This blog post is part two of a multi-post series about using the Citus extension with Percona Distribution for PostgreSQL. You can read part one here: Scalable Solutions With Percona Distribution for PostgreSQL: Set Up Three PostgreSQL Database Instances.

Citus is an open-source extension for PostgreSQL that expands its capacity to manage databases of varying scales, ranging from individual nodes to extensively distributed database clusters. Citus maintains seamless integration with PostgreSQL.

Citus has many advantages, one of which is the “Sharding Technique” and this is what we are going to explore in this blog post. We will use the “Sharding Technique” with Percona Distribution for PostgreSQL.

This is how Citus defines sharding: “Sharding is a technique used in database systems and distributed computing to horizontally partition data across multiple servers or nodes. It involves breaking up a large database or dataset into smaller, more manageable parts called Shards. Each shard contains a subset of the data, and together, they form the complete dataset.”

postgresql sharding

Citus official website. “Distribute Tables”, URL

There are three types of tables in a Citus cluster, each serving a distinct purpose. We will be utilizing the Distributed Tables type.

“These appear to be normal tables to SQL statements but are horizontally partitioned across worker nodes.” – Citus

Percona Distribution for PostgreSQL is a collection of tools to assist you in managing your PostgreSQL database system: it installs PostgreSQL and complements it with a selection of extensions that enable solving essential practical tasks efficiently. Some of these extensions are HAProxy, Patroni, pgAudit, and much more.

The goal of this tutorial is to show a simplified example of how you might use Citus with Percona Distribution PostgreSQL 15.

Ready?

Let’s start with…

For this tutorial, I am using

  1. Operating System Ubuntu 20.04
  2. Percona Distribution for PostgreSQL 15 on Ubuntu

If you want to check the version of Percona PostgreSQL, you can use the ‘psql’ command-line utility for interacting with PostgreSQL databases. To do this, open a terminal and run the following SQL query: “SELECT version();” This will display the version of Percona PostgreSQL installed on your system.

You will have this as an output:

Installing Citus for Distributed PostgreSQL

Let’s install Citus version 12 in our three nodes of Percona Distribution for PostgreSQL.

Add Citus as a PostgreSQL library

Once Citus is installed, we will add “citus” as a shared library to preload when the PostgreSQL server starts. This should be done on all three nodes of Percona Distribution for PostgreSQL.

Restart Percona PostgreSQL to apply the changes.

Create the Citus extension

We will create and enable the Citus extension within the database. Make sure you run these commands with the “postgres” user, also on all three Percona Distribution for PostgreSQL nodes.

Now, let’s check if the Citus extension has been correctly created.

You will see an output similar to this:

Set up a multi-node Citus cluster

We start by setting the coordinator host. This is used in a Citus distributed database environment to set the IP address and port of the coordinator node. We will take the first node we created as a coordinator host.

This coordinator node manages the overall cluster and coordinates queries and data distribution among other nodes.

It is time to add each node to the Citus cluster. Make sure to run these commands with “postgres” user.

Check distributed worker nodes

Now, let’s list the distributed nodes in the cluster, including the main node.

Or you can also list the active Postgres Citus nodes in the cluster:

Once this is ready, we will log in to the main PostgreSQL and create our tables “users” and “events”

Creating the database

In this example, we create two tables:

“users” table with the following columns:

  • user_id: This column will be used for sharding.
  • user_name: Name of the user text type.
  • email:  Email of the user text type.

And  “events” table with the following columns:

  • event_id: A serial primary key to ensure unique event IDs. This column will be used for sharding.
  • user_id: This column is to register user’s id.
  • event_time: A timestamp to record when the event occurred.
  • data: A JSONB column for storing event-related data.

Creating distributed tables

Now, let’s distribute the “users” and “events” tables across shards placed locally or on the Postgres Citus nodes. In this case, we are distributing through the “user_id” field.

Insert data into our sharded tables

Now, insert data into the sharded table:

When you use the primary key (user_id) to create a distributed table in Citus, it means you’re effectively “hash-partitioning” the data based on the user_id column.  This means that events with the same user_id will be stored together on the same shard.

Citus takes the values in the ID column, applies a hash function to each value, and assigns each row to a shard based on the hash value. This process ensures that rows with similar ID values are distributed across multiple shards in a way that attempts to distribute the data evenly.

Testing and query analysis

Query the sharded table:

“SET citus.explain_all_tasks TO on”; is used to enable the “explain all tasks” feature in the Citus extension for PostgreSQL. This feature is particularly useful when you’re working with distributed queries in a Citus cluster.

“EXPLAIN ANALYZE”, is used to analyze and explain the execution plan of a SQL query in PostgreSQL. 

We can see all tasks, and it also shows all Postgres Citus nodes.

We will analyze the first task.

The output showcases the advantage of parallel processing. Citus demonstrated its ability to parallelize operations, distributing the workload across multiple nodes, in this case on the nodes on ports 5433 and 5434, resulting in remarkably fast query execution times. 

This was an example of how to use Citus to sharding in “events” table. The main advantage of sharding with Citus is its ability to scale PostgreSQL databases horizontally, spreading data across multiple nodes to accommodate growing data sets and high workloads.

Note: Sharding may not be necessary for all applications; it’s typically most beneficial when dealing with large-scale datasets and high-concurrency workloads.

Check more about Percona PostgreSQL and multi-node Citus, and visit our Percona Community Forum page in case you have questions. We are happy to help!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments