File this post under “late to the game”, but I just completed a project where I used Apache Spark for the first time and I’m blown away. Here’s my experience.
No Cluster Needed
Perhaps it was my bias from working with Apache Impala a few years back, but I just assumed that Spark was going to need Hadoop set up on a cluster of servers. I didn’t want to spend my time getting all that set up just to play around with Spark, so I never bothered with it before.
Turns out, Spark has a rather robust single-machine setup. Even better, there’s an R library that took care of all the set up for me.
sparklyr Makes Spark Simple
The R package
sparklyr made my foray into Spark dead simple. The package happily installed Spark for me and provided me functions to easily start and stop a Spark instance from within my R scripts.
Pro tip: by default
sparklyr limits Spark to a single core when it starts up an instance. You can change to multiple cores pretty easily and it makes a world of difference in terms of performance.
dplyr and Spark Is a Powerful Combination
dplyr is amazing. Rather than hand-writing Spark SQL,
dplyr provides a set of functions that allowed me to join tables, add where clauses, and manipulate the columns returned from Spark.
My project was to explore replacing an existing part of our data pipeline. Using Spark, our processing time went from days to hours.
More Spark in the Future
After this successful venture into Spark territory, I’m pretty sure I’ll be employing Spark in future projects.