Apache Spark Tutorial | Fast Data Architecture Series

Continuing the Fast Data Architecture Series, this article will focus on Apache Spark.  In this Apache Spark Tutorial we will learn what Spark is and why it is important for Fast Data Architecture.  We will install Spark on our Mesos Cluster and run a sample spark application.

Fast Data Series Articles

  1. Installing Apache Mesos 1.6.0 on Ubuntu 18.04
  2. Kafka Tutorial for Fast Data Architecture
  3. Kafka Python Tutorial for Fast Data Architecture
  4. Apache Spark Tutorial | Fast Data Architecture Series

Video Introduction

Checkout my Apache Spark Tutorial Video on YouTube:

What is Apache Spark?

Apache Spark is a unified computing engine and collection of several libraries to help data scientists analyze big data.  Unified means that Spark aims to support many different data analysis tasks that can range from SQL query analysis to machine learning and graphing support.  Before Spark there was Hadoop MapReduce that was the dominate player in data analysis platforms.  Spark was developed to remedy some issues that were identified with Hadoop MapReduce.

The are several components to Spark that achieve the unified computing engine module as outlined in the Apache Spark Documentation.

Spark SQL – This a library that allows data scientist to analyze data using simple SQL queries.

Spark Streaming – A library that can process streaming data in real-time using batch processing.

MLlib – A machine learning library for Spark

GraphX – A library that adds graphing functionality for Spark

We will be covering these in much more detail in future articles.  We will install Apache Spark on our Mesos Cluster that we have installed in previous articles in the Fast Data Architecture Series.

Apache Spark Architecture

Spark consists of a Driver Program that manages the spark application.  Spark Driver Programs can be written in many languages including Python, Scala, Java and R.  The driver program splits a task into executors and schedules the executors to run.  You can install Spark applications on many cluster managers including Apache Mesos, Kubernetes, or in standalone mode.  In this Apache spark tutorial we will deploy the spark driver program to a Mesos cluster and run an example application that comes with spark to test it.

The Spark driver program schedules work on spark executors.  Executors actually carry out the work of the spark application.  In our Mesos environment these executors are scheduled on mesos nodes and are short lived.  They are created, carry out their assigned tasks, report their status back to the driver and then they are destroyed.

Install Apache Spark

Run the following on your Mesos Masters and all your Mesos Slaves.  You will also want to install on your local development system.

$ wget http://apache.claz.org/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
$ mkdir -p /usr/local/spark && tar -xvf spark-2.3.1-bin-hadoop2.7.tgz -C /usr/local/spark --strip-components=1
$ chown root:root -R /usr/local/spark/

 

This will create a binary installation of Apache Spark that we can use to deploy our Spark applications on using Mesos.  In the next section, we will create a SystemD Service that will run our Spark cluster dispatcher.

Create a SystemD Service Definition

When we have Spark deployed in cluster mode on a Mesos cluster we need to have the Spark Dispatcher running that will schedule our Spark applications on Mesos.  In this section we will create a SystemD Service definition that we will use to manage the Spark Dispatcher as a service.

On one of your Mesos Masters create a new file /etc/systemd/system/spark.service and add the following contents:

[Unit]
Description=Spark Dispatcher Service
After=mesos-master.service
Requires=mesos-master.service

[Service]
Environment=MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
ExecStart=/usr/local/spark/bin/spark-class org.apache.spark.deploy.mesos.MesosClusterDispatcher --master mesos://192.168.1.30:5050

[Install]
WantedBy=multi-user.target7

 

This file configures the Spark dispatcher service to start up after the mesos-master service.  Also notice that we specify the IP and port of our Mesos Master.  Be sure that yours reflects your actual IP address and port for your Mesos Master.  Now we can enable the service and start it

# systemctl daemon-reload
# systemctl start spark.service
# systemctl enable spark.service

 

You can make sure it is running using this command:

# systemctl status spark.service

 

If everything is working correctly, you will see the service as Started and Active. The next part of our Apache Spark Tutorial is to test our Spark deployment!

Testing Spark

Now that we have our Spark Dispatcher service running on our Mesos Cluster we can test it by having run an example job.  We will be using an example that comes with Spark that will calculate PI for us.

bin/spark-submit --name SparkPiTestApp --class org.apache.spark.examples.SparkPi --master mesos://192.168.1.30:7077 --deploy-mode cluster --executor-memory 1G --total-executor-cores 30 /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar 100

 

You will see that our example is scheduled:

.168.1.30:7077 --deploy-mode cluster --executor-memory 1G --total-executor-cores 30 /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar 100
2018-07-11 16:44:12 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-07-11 16:44:12 INFO  RestSubmissionClient:54 - Submitting a request to launch an application in mesos://192.168.1.30:7077.
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - Submission successfully created as driver-20180711164412-0001. Polling submission state...
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - Submitting a request for the status of submission driver-20180711164412-0001 in mesos://192.168.1.30:7077.
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - State of driver driver-20180711164412-0001 is now RUNNING.
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "serverSparkVersion" : "2.3.1",
  "submissionId" : "driver-20180711164412-0001",
  "success" : true
}
2018-07-11 16:44:13 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-07-11 16:44:13 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-4edf5319-8ff1-45bc-b4ad-56b1291f4125

 

To see the output we need to look at the Sandbox for our job in Mesos.  Go to your Mesos web interface http://{mesos-ip}:5050.  You should see that there is a task named Driver for SparkPiTestApp under Completed Tasks which is our job.

Apache Spark Tutorial - Spark Dispatcher Deployment

Click on the Sandbox link for our job then click on the stdout link to see the logging for our application.  You will see that it calculated Pi for us.

2018-07-11 16:44:20 INFO  DAGScheduler:54 - ResultStage 0 (reduce at SparkPi.scala:38) finished in 3.485 s
2018-07-11 16:44:20 INFO  DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 3.551479 s
Pi is roughly 3.1418855141885516

 

Conclusion

This Apache Spark Tutorial simply demonstrated how to get Apache Spark installed.  The true power of Spark lies in the APIs that it provides to write powerful analytical applications to process your raw data and provide meaningful results that you can use to make real-world business decisions.  Don’t miss out on the next several articles where we cover how to write Spark Applications using the Python API and continue in our exploration of the SMACK Stack.  If you haven’t already, please signup for my weekly newsletter so you will get updates when I release new articles.  Thanks for reading this tutorial.  If you liked it or hated it then please leave a comment below.