Data Consistency in Apache Cassandra — Part 1

Rajendra Uppal
Software Architecture
4 min readAug 24, 2017

--

For a quick introduction on what Apache Cassandra is, take a look here. Consistency is a significantly large topic to cover in one part. So I’ll be completing it in 3 parts. This first part defines consistency in general, write consistency, read consistency, consistency levels (CL), immediate, eventual and tunable consistency.

Consistency

The topic and concept of consistency is very important when you work with a distributed database like Cassandra. When you’re working with a database which runs on only one server, consistency is a non-issue. But when you’re running on multiple servers that can span multiple racks and multiple data centres, you can always run into issues where data on one server or data on one replica node is different from data on other replica node. So, what consistency technically means is that it refers to a situation where all the replica nodes have the exact same data at the exact same point in time.

Consistency Level (CL): is the number of replica nodes that must acknowledge a read or write request for the whole operation/query to be successful.

Write CL controls how many replica nodes must acknowledge that they received and wrote the partition.

Read CL controls how many replica nodes must send their most recent copy of partition to the coordinator.

Write Consistency

Write consistency means having consistent data (immediate or eventual) after your write query to your Cassandra cluster. You can tune the write consistency for performance (by setting the write CL as ONE) or immediate consistency for critical piece of data (by setting the write CL as ALL) Following is how it works:

  1. A client sends a write request to the coordinator.
  2. The coordinator forwards the write request (INSERT, UPDATE or DELETE) to all replica nodes whatever write CL you have set.
  3. The coordinator waits for n number of replica nodes to respond. n is set by the write CL.
  4. The coordinator sends the response back to the client.

Read Consistency

Read consistency refers to having same data on all replica nodes for any read request. Following is how it works:

  1. A client sends a read request to the coordinator.
  2. The coordinator forwards the read (SELECT) request to n number of replica nodes. n is set by the read CL.
  3. The coordinator waits for n number of replica nodes to respond.
  4. The coordinator then merges (finds out most recent copy of written data) the n number of responses to a single response and sends response to the client.

Read CL = ALL gives you immediate consistency as it reads data from all replica nodes and merges them, means keeps the most current data.

Read CL = ONE gives you benefit of speed, Cassandra only contacts one closest/fastest replica node, so throughput of the read request will be lower so performance will be higher. Also, it might so happen that 2 out of 3 replica nodes might be down or query might be failed and you will still get a result because CL = ONE, so you have highest availability. For all these benefits, the price you pay is lower consistency. So, your consistency guarantees are much lower.

Read CL = QUORUM (Cassandra contacts majority of the replica nodes) gives you a nice balance, it gives you high performance reads, good availability and good throughput.

Immediate Consistency vs. Eventual Consistency

So, with consistency in Cassandra, you have two core types of consistency. immediate consistency and eventual consistency.

Immediate consistency: is having the identical data on all replica nodes at any given point in time.

Eventual consistency: by controlling our read and write consistencies, we can allow our data to be different on our replica nodes, but our queries will still return the most correct version of the partition data.

What this means is that because we can choose between immediate and eventual consistency, we end up with a system that has tunable consistency. Tunable Consistency means that you can set the CL for each read and write request. So, Cassandra gives you a lot of control over how consistent your data is. You can allow some queries to be immediately consistent and other queries to be eventually consistent. That means, in your application, the data that requires immediate consistency, you can create your queries accordingly and the data for which immediate consistency is not required, you can optimize for performance and choose eventual consistency.

In next part 2, we will see how to achieve immediate and eventual consistency using different write and read consistency levels.

Comments and thoughts welcome. Cheers!

References:

  1. https://www.youtube.com/watch?v=hKLKpqY9UrY
  2. http://learn.exponential.io/p/cassandra-consistency

--

--

Rajendra Uppal
Software Architecture

Software Engineering Leader, Formerly at Microsoft, Adobe, Studied at IIT Delhi and IIT Kanpur. www.linkedin.com/in/rajendrauppal