Spark of Genius
File this post under “late to the game”, but I just completed a project where I used Apache Spark for the first time and I’m blown away. Here’s my experience.
No Cluster Needed
Perhaps it was my bias from working with Apache Impala a few years back, but I just assumed that Spark was going to need Hadoop set up on a cluster of servers. I didn’t want to spend my time getting all that set up just to play around with Spark, so I never bothered with it before.
Turns out, Spark has a rather robust single-machine setup. Even better, there’s an R library that took care of all the set up for me.
sparklyr
Makes Spark Simple
The R package sparklyr
made my foray into Spark dead simple. The package happily installed Spark for me and provided me functions to easily start and stop a Spark instance from within my R scripts.
Pro tip: by default sparklyr
limits Spark to a single core when it starts up an instance. You can change to multiple cores pretty easily and it makes a world of difference in terms of performance.
dplyr
and Spark Is a Powerful Combination
sparklyr
gave me access to the tables I loaded into Spark. dplyr
gave me the ability to manipulate and query those tables via dbplyr
.
dplyr
is amazing. Rather than hand-writing Spark SQL, dplyr
provides a set of functions that allowed me to join tables, add where clauses, and manipulate the columns returned from Spark.
Great Performance
My project was to explore replacing an existing part of our data pipeline. Using Spark, our processing time went from days to hours.
More Spark in the Future
After this successful venture into Spark territory, I’m pretty sure I’ll be employing Spark in future projects.