Cassandra vs MongoDB: Which is better for Big Data?

Cassandra vs MongoDB: Which is better for Big Data?

(Last Updated On: August 10, 2018)

In this post we will compare Apache Cassandra vs MongoDB.  Both systems are being used for storing big data but they do it very differently.

If you are on the fence as tho what database sink you want for your big data pipeline then hopefully this post will give you a good idea of what cassandra and mongodb can do for you.

Cassandra vs MongoDB Introduction

We will compare Apache Cassandra vs MongoDB to see which one fills your need.

Both solutions store data for you but they do it very different ways.  Cassandra stores data using something very similar to database tables and MongoDB stores data using “documents.”

We will start by showing the similarities between both.

Apache Cassandra

Obviously, both systems store data for you so that’s our first similarity.

Both systems store data in a distributed manner.

Cassandra distributes data using the PRIMARY KEY.  Each primary key value uses a single partition.  This means that it can only store one row of data per partition.

Not very useful if you have very large amounts of data consisting of thousands or millions of rows.  To accommodate this, Cassandra tables can have a CLUSTERING KEY that is a unique value and gives Cassandra the ability to store multiple rows per partition.

Take for example, this diagram of a Cassandra table representing purchases at a retail store.

Cassandra vs MongoDB: Cassandra Data Modelling

This shows that each store location has an ID and this becomes the PRIMARY KEY.  But remember that this only gives us one row because of the way that Cassandra stores data in a distributed manner.  Each PRIMARY KEY value is assigned a partition.

Cassandra takes that primary key value and puts a hash on it and assigns it a node to store the data.  This is how Cassandra can lookup values so quickly.

To get more than row you need to have a CLUSTERING KEY that contains unique values.  Rows with unique clustering key values will be stored on the same node as the primary key.

This makes data modelling a little more involved than you are probably used to.  You will find (as I did the hard way) that you can’t just create tables like you would in a normal SQL system and be able to query the data the same way.

Cassandra is the option I decided to go with to be my data sink for my logging pipeline consisting of Kafka, Cassandra and a Python Application I wrote.

Next, we will see how MongoDB compares.

MongoDB

This is a NoSQL database system.  I have heard it called No SQL, Non-SQL and Non-relational SQL but essentially what it means is that the data is stored using key/value pairs.

MongoDB stores data in JSON like objects that are called documents.

Here is an sample document for our retail example above.

{
  "item": "toothpaste",
  "cost": 4.99
},
{
  "item": "soda",
  "cost": 0.99
}

Each row is represented by a document.

A collection of rows is called a collection.

The really cool part about NoSQL databases is that there is no set schema.  You can have documents that don’t match the same structure in the same collection.

Interacting with a MongoDB database from your favorite language is a breeze because most languages support JSON.  Each document is read in from MongoDB and stored as a JSON value in your program.

This makes it super easy to get started with MongoDB.

Just like Cassandra, MongoDB is also a distributed storage system.

MongoDB distributes the documents among the different nodes in the cluster using a SHARD KEY which very similar to the PRIMARY KEY of Cassandra outlined above.  It uses the SHARD KEY to know what node to store the data on.

The performance of a MongoDB cluster greatly depends on the shard key you select.  You can read more about this on the MongoDB Sharding Documentation.

Conclusion

As you can see both systems can store your big data in a distributed manner but they do it very differently.

Hopefully, this post helped to clear up which one is better for your situation.

 

Leave a Comment

you're currently offline