Data Consistency in Apache Cassandra — Part 2

Published in

Software Architecture

3 min readAug 24, 2017

In part 1, I introduced the basics of consistency in general, write consistency, read consistency, consistency levels (CL), immediate, eventual and tunable consistency.

In this part 2, I will talk about how to achieve immediate and eventual consistency using different write and read consistency levels.

Source: http://guide.couchdb.org/draft/consistency.html

Cassandra has tunable consistency which means that not only on the database level, you can tune the immediate and eventual consistency of your data per query/operation by setting the read CL (consistency level) and write CL.

Immediate Consistency

Immediate Consistency is the ability of the database to always return the most recent data as we recall from part 1.

Immediate Consistency Formula

R + W > RF => immediate consistency

where R= number of nodes your data is read from.

W= number of nodes your data is written to.

RF= replication factor, how many nodes have exact same copy of your data.

Example: if you have replication factor as 3, i.e. 3 nodes are going to have exact same copy of your data, write CL as QUORUM, i.e. you are going to write to majority or 2 nodes and read CL as QUORUM, i.e. you are going to read from majority or 2 nodes, then you can see from the formula that you are going to have immediate consistency.

Eventual Consistency

Eventual Consistency is not having your data completely out of sync. EC is a situation where:

Cassandra cannot guarantee that all replica nodes will contain exact same copy of your data at the same time.
a query is NOT guaranteed to return the most recent data.

In practice, data is consistent on all nodes within a few milliseconds, you will get exact same data from your queries very often. But, you are NOT getting a guarantee. So, if you have an SLA that you need to meet for your application that requires that data is always fully consistent, then tune those queries for immediate consistency whereas for other queries where performance is more important, tune those queries for eventual consistency.

Achieving Immediate Consistency across Entire Cluster

There are 3 ways you can achieve IC across entire cluster:

1. Read optimized, set Write CL = ALL, Read CL = ONE

As you are setting write CL as ALL, you are going to write data to all nodes in your cluster which means you can read from any one of them, hence read CL = ONE and hence this configuration is read optimized.

2. Write optimized, set Write CL = ONE, Read CL = ALL

As you are writing to 1 node, this configuration is write optimized and as you are reading from all nodes, one is them is guaranteed to have the most recent data which will be found out by the coordinator and sent back to you.

3. Balanced, set Write CL = QUORUM, Read CL = QUORUM

QUORUM means you are writing/reading to/from simple majority number of the nodes in your cluster. This configuration is a great balance in this configuration. You get high performance reads and writes.

You’d never want to set write CL = ALL, read CL = ALL. Because it is redundant, you get no benefit of this configuration and you’re only going to reduce throughput, and performance as well as lower availability.

Achieving Immediate Consistency across local Data Center

Read optimized, set Write CL = ALL, Read CL = LOCAL_ONE
Write optimized, set Write CL = LOCAL_ONE, Read CL = ALL
Balanced, set Write CL = LOCAL_QUORUM, Read CL = LOCAL_QUORUM

If your cluster has multiple data centers in say A, B and C cities. Using this configuration, you can optimize for your users in a particular data center and their requests won’t go to other data centers.

Achieving Eventual Consistency

Entire Cluster, set Write CL = ONE, Read CL = ONE
Local Data Center, set Write CL = LOCAL_ONE, Read CL = LOCAL_ONE

This is configuration is performance optimized. As we know, data sync across all nodes is few milliseconds away, so you get great performance using this configuration. So, you give up few milliseconds of consistency here and get highest throughput, highest performance and highest availability.

In next part 3, we will go a bit deeper into each of the above configurations and will understand different consistency levels.

Comments and thoughts welcome. Cheers!