Continuing our Fast Data Architecture series, we will install Cassandra on Ubuntu 18.04 and configure it to run as a SystemD service.
Fast Data Series Articles
- Installing Apache Mesos 1.6.0 on Ubuntu 18.04
- Kafka Tutorial for Fast Data Architecture
- Kafka Python Tutorial for Fast Data Architecture
- Apache Spark Tutorial | Fast Data Architecture Series
- Install Cassandra on Ubuntu 18.04
What is Cassandra?
Cassandra is a distributed database that is highly available and can store mass amounts of data. It is being used by large companies like Netflix, Apple, and eBay to manage their data with millions of requests a day.
Cassandra is very fault-tolerant. It can be scaled to hundreds or thousands of nodes where data is automatically replicated. Even if you lose an entire data center your data will be safe. Replication across data centers is also supported by Cassandra. Best of all, Cassandra is decentralized meaning that there is no single point of failure. Failed nodes can be replaced without any downtime.
Where does Apache Cassandra fit in out SMACK stack? Cassandra fills the role of long term database storage. We use Spark to perform data analytics in memory which is super fast but we will need to store some long term data to disk. This is where Cassandra comes in. Applications developed for our SMACK stack will utilize Cassandra to store data on disk.
Install Cassandra on Ubuntu 18.04
In order to install Cassandra on Ubuntu 18.04 we will need to get some prerequisites out of the way first. For this exercise I created a virtual machine with 4 vCPUs and 1G of memory. I allocated 200Gb of hard drive space. Since this is a lab system I wanted to keep it modest. Make sure that you service is fully updated.
# apt update && apt upgrade -y
Next, we will need to ensure we have Java 8 and Python 2.7 installed.
# apt install openjdk-8-jdk -y # apt install python -y
In the next section, we will begin installing Cassandra
We will download the latest version of Cassandra which as of this writing is version 3.11.2.
# wget http://apache.claz.org/cassandra/3.11.2/apache-cassandra-3.11.2-bin.tar.gz
Untar the package and move to a better home
# tar -xzvf apache-cassandra-3.11.2-bin.tar.gz # mv apache-cassandra-3.11.2 /usr/local/cassandra
Next we will create a user for Cassandra to run as.
Creating a Cassandra User
Cassandra doesn’t like to run as root so we will need to create a cassandra user and a group so that we can set the correct permissions
# useradd cassandra # groupadd cassandra # usermod -aG cassandra cassandra # chown root:cassandra -R /usr/local/cassandra/ # chmod g+w -R /usr/local/cassandra/
We can test to make sure that our permissions are correct by starting Cassandra as our new cassandra user.
# su - cassandra $ /usr/local/cassandra/bin/cassandra -f
You should see a bunch of logs scroll by ending with something like the line below:
INFO [main] 2018-07-22 16:00:13,807 CassandraDaemon.java:529 - Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
Open another terminal and login as your normal user. Cassandra comes with a command line SQL client called CQLSH that we can use to try and connect to our server and verify that everything is working.
$ /usr/local/cassandra/bin/cqlsh localhost Connected to Test Cluster at localhost:9042. [cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh>
Run the following SQL query:
cqlsh> select cluster_name, listen_address from system.local; cluster_name | listen_address --------------+---------------- Test Cluster | 127.0.0.1 (1 rows) cqlsh>
This shows that the CQLSH command line client successfully connected to your cassandra database server and ran a query. Next, we will configure Cassandra to run as a SystemD Service.
Cassandra SystemD Service
Create a new file /etc/systemd/system/cassandra.service and add the following contents:
[Unit] Description=Cassandra Database Service After=network-online.target Requires=network-online.target [Service] User=cassandra Group=cassandra ExecStart=/usr/local/cassandra/bin/cassandra -f [Install] WantedBy=multi-user.target
Notice that we configured the user and group as cassandra in the Service section. This will tell SystemD to run our service as the correct user that we configured earlier. Next we will start our new service and enable it to start on system boot.
# systemctl daemon-reload # systemctl start cassandra.service # systemctl enable cassandra.service
In order to connect to our Cassandra server we will need to configure it. We will be changing configuration in the main Cassandra configuration file located at /usr/local/cassandra/conf/cassandra.yaml. Open this file so and make the following changes.
First, we will need to change the listening address for Cassandra. Set this to the IP address of your server. In my case this is 192.168.1.47.
Next, we need to set the RPC listen address so that remote connections from CQLSH will work. Again we set this to the IP address of our server.
Note: Don’t change the Cluster Name (cluster_name) at this point because we already started the database and doing so will cause cassandra to throw an exception and not work.
Lastly, we need to update the seeds parameter:
seed_provider: # Addresses of hosts that are deemed contact points. # Cassandra nodes use this list of hosts to find each other and learn # the topology of the ring. You must change this if you are running # multiple nodes! - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: # seeds is actually a comma-delimited list of addresses. # Ex: "<ip1>,<ip2>,<ip3>" - seeds: "192.168.1.47"
Save the file and exit.
Now restart your Cassandra service
# systemctl restart cassandra
You are now able to connect to your Cassandra Database remotely. Simply download the cassandra package on your development system just like we did on the server. Connect remotely
[email protected]:~/Downloads/cassandra$ bin/cqlsh cassandra.admintome.lab Connected to Test Cluster at cassandra.admintome.lab:9042. [cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh> select cluster_name, listen_address from system.local; cluster_name | listen_address --------------+---------------- Test Cluster | 192.168.1.47 (1 rows) cqlsh>
Here is the same example which I ran in my lab.
Cassandra is now installed on our SMACK stack cluster and ready to be utilized. In the post in my Fast Data Architecture, we will write a PySpark application to take our data collected from Google Analytics and format the data and save it to our new Cassandra database so that we can begin performing data analytics on that data.
Be sure to signup for the AdminTome Newsletter and get weekly updates on my latest Big Data articles.
If you enjoyed this post please leave a comment below. I would love to hear from you.