Kafka Tutorial for Fast Data Architecture

In this installment of my Fast Data Architecture series I will be giving you a Kafka Tutorial.  We will install on Mesos and Marathon and create a couple topics.

Fast Data Series Articles

  1. Installing Apache Mesos 1.6.0 on Ubuntu 18.04
  2. Kafka Tutorial for Fast Data Architecture
  3. Kafka Python Tutorial for Fast Data Architecture
  4. Apache Spark Tutorial | Fast Data Architecture Series

This is the second article in my Fast Data Architecture series where we will be implementing a SMACK stack (Spark, Mesos, Akka, Cassandra, and Kafka) in order to implement a Big Data infrastructure.

Kafka Tutorial

This article assumes that you already have a Mesos cluster up and working.  If you don’t then follow Installing Apache Mesos 1.6.0 on Ubuntu 18.04 and come back when you are done.

We will begin with what Kafka is and how it works.  We will finish up the article by walking you through installing and configuring Kafka to run on your Mesos cluster by installing Marathon to run our Kafka Scheduler.

What is Kafka?

Kafka is simply a pubsub solution which means a publish/subscribe messaging provider.  Applications publish messages to a topic and other applications subscribe to the topic and can read the messages.

kafka tutorial - message flow
Kafka Message Flow

The diagram above shows that we have a Kafka topic.  Applications on smart phones or Web sites can publish data to the topic and are examples of Publishers.  Big Data analytics applications on the right subscribe to the topic and can receive the messages in the order that the messages were received from the publishers.  The big data analytics applications are called subscribers.

Kafka stores the streams of data (topics) safely in a distributed, replicated, and fault tolerant manner.  This means that it makes use of clustering as we will see in the next section.

Each message in the topic has an offset that identifies the message.  Messages in a topic are immutable in that a publisher or subscriber can’t delete them.  The big data analytics applications can reference any message using it’s offset and as a result can start reading messages at any point in the topic.

Kafka Architecture

As mentioned above, Kafka is a distributed system and therefore can run on several nodes.  Each Kafka instance running on a node is called a Kafka Broker.  Brokers are managed by the Kafka Scheduler.

Kafka keeps track of everything using Zookeeper which is good news for us because we already configured Zookeeper when we installed Mesos in the last installment of this series!

Lastly, Kafka has an API that is used to manage the cluster.  It is through this API that we will create topics, partitions and more.

Speaking of partitions, a partition is where we distribute our topics to different Kafka brokers which gives us the distributed, and replicated features of Kafka.  In the diagram above we could have a topic spread accross the two brokers.  Each copy of the topic is called a partition of the topic.

That covers the big rocks of what Kafka does and how it does it.  Now lets move on and get our hands dirty by implementing a Kafka cluster! But first we need to install Marathon.

Marathon Installation

Marathon is a long running task scheduler for Mesos.  Most jobs in Mesos are designed to start up, do their thing and then be destroyed.  Our Kafka Scheduler should be a long running process so that it can schedule Kafka Brokers and handle the API messaging.

Luckily for us, installing Marathon on your cluster is a pretty easy process.  Run the following commands on one of you Mesos Masters:

$ curl -O http://downloads.mesosphere.com/marathon/v1.5.1/marathon-1.5.1.tgz
$ tar xzf marathon-1.5.1.tgz

This will download the latest version of Marathon (as of this writing) and untar the file.  Move the files to more accessible location:

$ sudo mv marathon-1.5.1 /usr/local/marathon

Next, we will need to create a SystemD service definition so that we can run the service.  Create a new file called /etc/systemd/system/marathon.service and add the following contents:

[sudo] password for admintome: 
[email protected]:~# cat /etc/systemd/system/marathon.service 
[Unit]
Description=Marathon Service
After=mesos-master.service
Requires=mesos-master.service
Environment=MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so

[Service]
ExecStart=/usr/local/marathon/bin/marathon --master zk://192.168.1.30:2181,192.168.1.31:2181,192.168.1.32:2181/mesos --zk zk://192.168.1.30:2181,192.168.1.31:2181,192.168.1.32:2181/marathon

[Install]
WantedBy=multi-user.target

Make sure that if the IPs of your Mesos masters are different that you update the IPs above to match.  Also notice in the Environment section we set an environment variable MESOS_NATIVE_JAVA_LIBRARY to point to where we have our libmesos.so.  Update that directory if yours is in a different location for some reason.

You should now be able to start and enable your Marathon SystemD service.

# systemctl daemon-reload
# systemctl start marathon.service
# systemctl enable marathon.service

Browse to the IP or hostname of the Mesos master you installed Marathon on using the following URL:

http://mesos1.admintome.lab:8080

You should see the Marathon GUI:

The last thing you will need to do to prepare Marathon to run our Kafka Scheduler is to ensure that you can run Docker containers on your Mesos Slaves.  Fortunately, I have an article for that: MESOS DOCKER CONTAINER CONFIGURATION.  Follow the instructions in that article in order to enable Docker Containerization on your Mesos Slaves.

Congratulations!  You now have Marathon installed and we are ready to get Kafka installed and configured.

Kafka Installation

Now that Marathon is ready to go we can begin installing Kafka.  The step is to download the Kafka binaries.  The first thing you need to do is download the Kafka binaries.

# git clone https://github.com/mesos/kafka.git
# cd kafka

In order to run the Kafka Scheduler on Marathon we will need to build a Docker container.  In order to do this, they have provided everything you need in the src/docker.

# cd src/docker
# ./build-image.sh

Once complete you will have have a docker image named {username}/kafka-mesos ready on your local system.

Now all you have to do is push it to your docker repository and you are ready.  If you don’t want to go through all that I have a copy on my Docker Hub:

https://hub.docker.com/r/admintome/kafka-mesos/

Just run the following command on all your mesos slaves (after you have docker installed if you followed my other article (MESOS DOCKER CONTAINER CONFIGURATION).

# docker pull admintome/kafka-mesos

Or if you built your own container be sure to pull it on all your Mesos Slaves.  Next in Marathon, click on the Create Application button.  The New Application dialog will pop up.  Click the selector for JSON Mode:

Replace all the JSON there with the JSON below:

{
  "id": "kafka-mesos-scheduler",
  "cmd": "./kafka-mesos.sh scheduler --master=zk://192.168.1.30:2181,192.168.1.31:2181,192.168.1.32:2181/mesos --zk=192.168.1.30:2181,192.168.1.31:2181,192.168.1.32:2181 api=http://mslave2:7000 --storage=zk:/kafka-mesos",
  "cpus": 0.5,
  "mem": 256,
  "disk": 0,
  "instances": 1,
  "container": {
    "docker": {
      "image": "admintome/kafka-mesos"
    },
    "type": "DOCKER",
    "volumes": []
  },
  "networks": [
    {
      "mode": "host"
    }
  ],
  "portDefinitions": [
    {
      "port": 0,
      "protocol": "tcp",
      "name": null,
      "labels": null
    }
  ],
  "env": {},
  "healthChecks": [],
  "labels": {},
  "constraints": [
    [
      "hostname",
      "LIKE",
      "mslave2"
    ]
  ]
}

Keen observers will not the constraints section at the bottom.  I am telling Marathon to only schedule the Kafka Scheduler on mslave2.admintome.lab.  The reason is the API parameter above it on the cmd line.  I need to know what IP to use in order to talk to the Kafka API so I hard coded it here to the IP of mslave2.admintome.lab which is 192.168.1.34.  There may be a way around this and if you know that way then please comment.  I would love to be able to have multiple copies of the Kafka Scheduler running for redundancy but I didn’t want to take the time to setup a reverse proxy.

Finally, click on the Create Application at the bottom of the dialog and you should see the Kafka Scheduler transition to a Running state.

Next we will need to configure a broker.

Creating a Kafka Broker

We will be creating a Kafka Broker using the command line that will contact the Kafka API and start the broker.  Run the command below to create a broker:

[email protected]:~/kafka$ ./kafka-mesos.sh broker add 0 --api=http://mslave2.admintome.lab:7000
broker added:
  id: 0
  active: false
  state: stopped
  resources: cpus:1.00, mem:2048, heap:1024, port:auto
  jvm-options: 
  failover: delay:1m, max-delay:10m
  stickiness: period:10m
  metrics: 
    collected: 1970-01-01 00:00:00Z
    under-replicated-partitions: 0
    offline-partitions-count: 0
    is-active-controller: 0

List all the brokers and you should see the one you just created.

[email protected]:~/kafka$ ./kafka-mesos.sh broker list --api=http://mslave2.admintome.lab:7000
broker:
  id: 0
  active: false
  state: stopped
  resources: cpus:1.00, mem:2048, heap:1024, port:auto
  jvm-options: 
  failover: delay:1m, max-delay:10m
  stickiness: period:10m
  metrics: 
    collected: 1970-01-01 00:00:00Z
    under-replicated-partitions: 0
    offline-partitions-count: 0
    is-active-controller: 0

Next, we need to start the broker:

[email protected]:~/kafka$ ./kafka-mesos.sh broker start 0 --api=http://mslave2.admintome.lab:7000
broker started:
  id: 0
  active: true
  state: running
  resources: cpus:1.00, mem:2048, heap:1024, port:auto
  jvm-options: 
  failover: delay:1m, max-delay:10m
  stickiness: period:10m, hostname:mslave1.admintome.lab
  task: 
    id: kafka-0-9abed3fe-9cc4-4b32-9dbe-9b86937684f3
    state: running
    endpoint: mslave1.admintome.lab:31000
  metrics: 
    collected: 2018-06-24 02:20:08Z
    under-replicated-partitions: 0
    offline-partitions-count: 0
    is-active-controller: 1

If you wanted to start multiple brokers use a command line like the following:

[email protected]:~/kafka$ ./kafka-mesos.sh broker add 0..2

That command will start 3 brokers.  For the curious (or brave) you can check out your new broker as a new process running in Mesos.

Creating a Topic

Next we need to create a topic.  Use the following command to create a topic:

[email protected]:~/kafka$ ./kafka-mesos.sh topic add admintome --broker=0 --api=http://mslave2.admintome.lab:7000
topic added:
  name: admintome
  partitions: 0:[0]

Awesome!  We now have a new Kafka Topic called ‘admintome’ that we can push messages to from a Publisher and consume with a Subscriber.  Usually, this would mean writing something in your favorite language to interact with your topic.  However, if you just want to test everything out then keep reading:

Testing your Kafka topic

The Mesos-Kafka repository doesn’t come with everything you will need to interact with Kafka.  You will still need to download the Kafka binaries in order to do stuff like we want to do which is test our new topic. You can run the commands in this section on any system you want.  I used my development system.  Just make sure you have java installed.

$ wget https://github.com/apache/kafka/archive/1.1.0.tar.gz
$ tar -xzvf 1.1.0.tar.gz
$ cd kafka-1.1.10/

Now we need to build the code.  Kafka is written in Scala so we need to install it first along with Gradle.

$ sudo apt install scala gradle

Now we need to build everything

$ gradle
$ ./gradlew -PscalaVersion=2.11 releaseTarGz
$ sudo mkdir /usr/local/kafka
$ sudo tar -xvf core/build/distributions/kafka_2.11-1.1.0.tgz -C /usr/local/kafka

The great thing about Kafka is that it uses Zookeeper to keep track of everything.  This means we just need to point our commands to the Zookeeper URI and the Kafka tools below will be able to find everything!  Let’s list our topics as a demonstration.  We should see our admintome topic we created earlier:

$ sudo /usr/local/kafka/kafka_2.11-1.1.0/bin/kafka-topics.sh --list --zookeeper 192.168.1.30:2181,192.168.1.31:2181,192.168.1.32:2181
admintome

Awesome!  It shows us our topic that we created earlier just as expected.  Next we need to create a Publisher in order to push some messages to our topic.  Remember that broker list command we used in Mesos-Kafka?  That had a very important line that we will need here.  Run the command again.

broker:
  id: 0
  active: true
  state: running
  resources: cpus:1.00, mem:2048, heap:1024, port:auto
  jvm-options: 
  failover: delay:1m, max-delay:10m
  stickiness: period:10m, hostname:mslave1.admintome.lab
  task: 
    id: kafka-0-9abed3fe-9cc4-4b32-9dbe-9b86937684f3
    state: running
    endpoint: mslave1.admintome.lab:31000
  metrics: 
    collected: 2018-06-24 03:14:08Z
    under-replicated-partitions: 0
    offline-partitions-count: 0
    is-active-controller: 1

We are going to need the endpoint value there so that we can tell the kafka-console-producer script how to reach our Kafka Broker that we created.

Next run the command below replacing the –broker-list parameter with the endpoint value from above.  In my case it is mslave1.admintome.lab:31000.

$ sudo /usr/local/kafka/kafka_2.11-1.1.0/bin/kafka-console-producer.sh --broker-list mslave1.admintome.lab:31000 --topic admintome
>test
>im done

Hit CTRL-D when you are done.  We have just sent a couple of messages to our topic and they sit waiting for us to pull them.

Finally, lets run a consumer and pull our messages!

$ sudo /usr/local/kafka/kafka_2.11-1.1.0/bin/kafka-console-consumer.sh --bootstrap-server mslave1.admintome.lab:31000 --topic admintome --from-beginning
test
im done
^CProcessed a total of 2 messages

There you have it!  You pulled all the messages from your topic.

We now have the first piece of our Fast Data SMACK stack.  We have a way to consume data that is distributed.  Our next step is to analyze that data so that we can begin to make informed business decisions with that data.  The next couple of articles in this series will cover installing and configuring Apache Spark and touch on some data analytics.  Join my mailing list below to get my latest articles.  Thanks for reading and I truly hope you have enjoyed it as much as I have enjoyed writing it.  Please a comment below to let me know what you think.

Conclusion

I hope you have enjoyed this article, if so please leave a comment below.  For more articles, please signup for the AdminTome Blog below.  Also please feel free to share the article to your friends using the buttons to the left.  Thanks again for reading this post.