Big Data Python: 3 Big Data Analytics Tools

Big Data Python: 3 Big Data Analytics Tools

In this post, we will discuss 3 awesome big data Python tools to increase your big data programming skills using production data.


In this article, I assume that you are running Python in it’s own environment using virtualenv, pyenv, or some other variant.

The examples in this article make use of IPython so make sure you have it installed to follow along if you like.

$ mkdir python-big-data
$ cd python-big-data
$ virtualenv ../venvs/python-big-data
$ source ../venvs/python-big-data/bin/activate
$ pip install ipython
$ pip install pandas
$ pip install pyspark
$ pip install scikit-learn
$ pip install scipy

Now let’s get some data to play around with.

Python Data

As we go through this article, I will be using some sample data to go through the examples.

The Python Data that we will be using are actual production logs from this website over the course of a couple days time.  This data isn’t technically big data yet because it is only about 2 Mb in size, but it will work great for our purposes.

I have to beef up my infrastructure a bit in order to get big data sized samples ( > 1Tb ).

To get the sample data you can use git to pull it down from my public GitHub repo: admintome/access-log-data

$ git clone

The data is a simple CSV file so each line represents an individual log and the fields separated by commas:

2018-08-01 17:10,'www2','www_access',' - - [01/Aug/2018:17:10:15 +0000] "GET /wp-content/uploads/2018/07/spark-mesos-job-complete-1024x634.png HTTP/1.0" 200 151587 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"'

Here is the schema for a log line:

Sample Big Data Schema

Now that we have the data we are going to use lets checkout 3 big data python tools.

Because of the complexity of the many operations that can be performed on data, this article will focus on demonstrating how to load our data and get a small sample of the data.

For each tool listed, I will give links to find out more information.

Python Pandas

The first tool we will discuss is Python Pandas.  As it’s website states, Pandas is an open source Python Data Analysis Library.

Let’s fire up IPython and do some operations on our sample data.

import pandas as pd

headers = ["datetime", "source", "type", "log"]
df = pd.read_csv('access_logs_parsed.csv', quotechar="'", names=headers)

After about a second it should respond back with:

[6844 rows x 4 columns]

In [3]: 

As you can see we have about 7000 rows of data and we can see that it found four columns which matches our schema described above.

Pandas created a DataFrame object representing our CSV file automatically!  Let’s check out a sample of the data imported with the head() function.

In [11]: df.head()
           datetime source        type                                                log
0  2018-08-01 17:10   www2  www_access - - [01/Aug/2018:17:10:15 +0000]...
1  2018-08-01 17:10   www2  www_access - - [01/Aug/2018:17:10:15 +000...
2  2018-08-01 17:10   www2  www_access - - [01/Aug/2018:17:10:22 +000...
3  2018-08-01 17:10   www2  www_access - - [01/Aug/2018:17:10:50 +0000]...
4  2018-08-01 17:11   www2  www_access - - [01/Aug/2018:17:11:11 +0000]...

There is a ton you can do with Python Pandas and Big Data.  Python alone is great for munging your data and getting it prepared.  Now with Pandas you can do data analytics in Python as well.  Data scientists typically use Python Pandas together with IPython to interactively analyze huge data sets and gain meaningful business intelligence from that data.  Checkout their website above for more information.


The next tool we will talk about is PySpark.  This is a library from the Apache Spark project for Big Data Analytics.

PySpark gives us a lot of functionality for Analyzing Big Data in Python.  It comes with its own shell that you can run from the command line.

$ pyspark

This loads the pytspark shell.

(python-big-data) [email protected]:~/Development/access-log-data$ pyspark
Python 3.6.5 (default, Apr  1 2018, 05:46:30) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
2018-08-03 18:13:38 WARN  Utils:66 - Your hostname, admintome resolves to a loopback address:; using instead (on interface enp0s3)
2018-08-03 18:13:38 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-08-03 18:13:39 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1

Using Python version 3.6.5 (default, Apr  1 2018 05:46:30)
SparkSession available as 'spark'.

And when you start the shell, you also get a web GUI to see the status of you jobs simply browse to http://localhost:4040and you will get the PySpark Web GUI

PySpark - Python tool for big data

Let’s use the PySpark Shell to load our sample data.

dataframe ="csv").option("header","false").option("mode","DROPMALFORMED").option("quote","'").load("access_logs.csv")

PySpark will give us a sample of the DataFrame that was created.

|             _c0| _c1|       _c2|                 _c3|
|2018-08-01 17:10|www2|www_access| - -...|
|2018-08-01 17:10|www2|www_access| -...|
|2018-08-01 17:10|www2|www_access| -...|
|2018-08-01 17:10|www2|www_access| - -...|
|2018-08-01 17:11|www2|www_access| - -...|
|2018-08-01 17:11|www2|www_access| - -...|
|2018-08-01 17:11|www2|www_access| - -...|
|2018-08-01 17:12|www2|www_access| - - [...|
|2018-08-01 17:12|www2|www_access| - -...|
|2018-08-01 17:12|www2|www_access| - - [...|
|2018-08-01 17:12|www2|www_access| - -...|
|2018-08-01 17:14|www2|www_access| - -...|
|2018-08-01 17:14|www2|www_access| - -...|
|2018-08-01 17:14|www2|www_access| - - ...|
|2018-08-01 17:15|www2|www_access| - - ...|
|2018-08-01 17:18|www2|www_access| - - [...|
|2018-08-01 17:18|www2|www_access| - ...|
|2018-08-01 17:19|www2|www_access| - - [...|
|2018-08-01 17:19|www2|www_access| - -...|
|2018-08-01 17:19|www2|www_access| - - ...|
only showing top 20 rows

Again we can see that there are four columns in our DataFrame which matches our schema.  A DataFrame is simply an in-memory representation of the data and can be thought of like a database table or excel spreadsheet.

Now on to our last tool.

Python SciKit-Learn

Any discussion on big data will invariably lead to a discussion about Machine Learning.  And luckily for us Python developers we have plenty of options to make use of Machine Learning algorithms.

Without going into too much detail on Machine Learning, we need to get some data to perform learning on.  The sample data I have provided in this article doesn’t work well as-is because it is not numerical data.  We would need to manipulate the data and present it into a numerical format which is beyond the scope of this article.  For example, we could map the log entries by time to get a DataFrame with two columns: number of logs in a minute and the current minute:

| 2018-08-01 17:10 | 4 |
| 2018-08-01 17:11 | 1 |

With our data in this form we can perform Machine Learning to predict the number of visitors we are likely to get in a future time.  But like I mentioned, that is outside of the scope of this article.

Luckily for us, SciKit-Learn comes with some sample data sets!  Let’s load some sample data and see what we can do.

In [1]: from sklearn import datasets

In [2]: iris = datasets.load_iris()

In [3]: digits = datasets.load_digits()

In [4]: print(
[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]

This loads two datasets are used for classification machine learning algorithms for classifying your data.

Checkout the SciKit-Learn Basic Tutorial for information.


Given these three Python Big Data tools, Python is a major player in the Big Data game along with R and Scala.

I hope that you have enjoyed this article.  If you have then please share it.  Also please comment below.

If you are new to Big Data and would like to know more then be sure to register for my free Introduction to Big Data course at AdminTome Online-Training.

Also be sure to see other great Big Data articles on AdminTome Blog.