Every last friday of the month we organise an internal meetup at Sourcelabs. Normally we will pick one subject and dive in to it together. But since a lot of colleagues were on holiday this time we decided to do it a bit differently. This time we all picked our own subject and we worked on that for the day. This is blogpost is about the subject that Daniël worked on that day.
Apache Spark is a framework for building big data pipelines. It supports multiple programming languages, although Kotlin is not officially supported, so I decided to use Java instead. The first challenge was to get a basic Hello World example working. Spark works by starting a so called Spark session, which boots up the framework and acts as an interface through which to use its functionality. The main object type that Spark works with is an RDD, a Resilient Distributed Dataset. The benefit of an RDD is that you can treat it as a single dataset, but in the background the data can be distributed over multiple workers in a cluster, allowing parallelisation and other kinds of optimisation techniques. The Hello World example creates a list of strings and maps the list to an RDD. A simple count operation is performed and the result is printed.
Next, I wanted to actually use some larger sized and interesting data and perform queries on it. The International Movie Database (IMDb) provides a CSV with basic information about movies. Spark SQL can be used to read the file into a Dataset object and perform SQL queries on it, with very few lines of code. The result can be found here.
It is perfectly fine to learn working with Apache Spark on your local machine, but the real benefit comes when it runs on a cluster with multiple nodes. Google Cloud Dataproc provides exactly this. The cluster is managed and installed for you. Apache Spark does not have its own storage system, and is therefore often used in conjunction with Hadoop. This is also the case in Dataproc. In AWS, S3 can be used instead. To learn about Dataproc, I followed this Qwiklabs lesson. Qwiklabs is a very convenient platform to learn about Google Cloud tooling. A course provides you with temporary cloud credentials and resources. The course discussed how to interact with the Dataproc cluster, how to load a dataset about flight information, and how to use a logistic regression model to make predictions about flight delays.
Based on one day of hacking, Apache Spark seems like an easy to learn framework for creating big data pipelines. Especially when running a managed cluster in Google Cloud with Dataproc. If your organisation needs to perform operations on large datasets (files, database tables or anything else), I would recommend giving it a try.