Scalable Solutions with Percona Distribution for PostgreSQL (Part 2): Using Citus

This blog post is part two of a multi-post series about using the Citus extension with Percona Distribution for PostgreSQL. You can read part one here: Scalable Solutions With Percona Distribution for PostgreSQL: Set Up Three PostgreSQL Database Instances.

Citus is an open-source extension for PostgreSQL that expands its capacity to manage databases of varying scales, ranging from individual nodes to extensively distributed database clusters. Citus maintains seamless integration with PostgreSQL.

Citus has many advantages, one of which is the “Sharding Technique” and this is what we are going to explore in this blog post. We will use the “Sharding Technique” with Percona Distribution for PostgreSQL.

This is how Citus defines sharding: “Sharding is a technique used in database systems and distributed computing to horizontally partition data across multiple servers or nodes. It involves breaking up a large database or dataset into smaller, more manageable parts called Shards. Each shard contains a subset of the data, and together, they form the complete dataset.”

postgresql sharding

Citus official website. “Distribute Tables”, URL

There are three types of tables in a Citus cluster, each serving a distinct purpose. We will be utilizing the Distributed Tables type.

“These appear to be normal tables to SQL statements but are horizontally partitioned across worker nodes.” – Citus

Percona Distribution for PostgreSQL is a collection of tools to assist you in managing your PostgreSQL database system: it installs PostgreSQL and complements it with a selection of extensions that enable solving essential practical tasks efficiently. Some of these extensions are HAProxy, Patroni, pgAudit, and much more.

The goal of this tutorial is to show a simplified example of how you might use Citus with Percona Distribution PostgreSQL 15.

Ready?

Let’s start with…

For this tutorial, I am using

Operating System Ubuntu 20.04
Percona Distribution for PostgreSQL 15 on Ubuntu
- We need three nodes of Percona Distribution for PostgreSQL. Check the “Scalable Solutions with Percona Distribution for PostgreSQL (Part 1)” blog post to set up three nodes.
- Citus-12.0
  - Citus usually installs PostgreSQL, but I already have Percona PostgreSQL installed, and in this blog, we will only install Citus.

If you want to check the version of Percona PostgreSQL, you can use the ‘psql’ command-line utility for interacting with PostgreSQL databases. To do this, open a terminal and run the following SQL query: “SELECT version();” This will display the version of Percona PostgreSQL installed on your system.

sudo su postgres
psql

1 2	sudo su postgres psql

You will have this as an output:

psql (15.4 - Percona Distribution)
Type "help" for help.

postgres=# SELECT VERSION();
                                                           version
------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 15.4 - Percona Distribution on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
(1 row)

psql (15.4 - Percona Distribution)

Type "help" for help.

postgres=# SELECT VERSION();

version

------------------------------------------------------------------------------------------------------------------------------

PostgreSQL 15.4 - Percona Distribution on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit

(1 row)

Installing Citus for Distributed PostgreSQL

Let’s install Citus version 12 in our three nodes of Percona Distribution for PostgreSQL.

# Let's download the deb.sh script to install Citus
curl https://install.citusdata.com/community/deb.sh > add-citus-repo.sh

# Execute the script
sudo bash add-citus-repo.sh

# Install Citus package
sudo apt-get -y install postgresql-15-citus-12.0

# Let's download the deb.sh script to install Citus

curl https://install.citusdata.com/community/deb.sh > add-citus-repo.sh

# Execute the script

sudo bash add-citus-repo.sh

# Install Citus package

sudo apt-get -y install postgresql-15-citus-12.0

Add Citus as a PostgreSQL library

Once Citus is installed, we will add “citus” as a shared library to preload when the PostgreSQL server starts. This should be done on all three nodes of Percona Distribution for PostgreSQL.

sudo su postgres
psql
ALTER SYSTEM SET shared_preload_libraries=citus;

sudo su postgres

psql

ALTER SYSTEM SET shared_preload_libraries=citus;

Restart Percona PostgreSQL to apply the changes.

sudo systemctl restart postgresql

1	sudo systemctl restart postgresql

Create the Citus extension

We will create and enable the Citus extension within the database. Make sure you run these commands with the “postgres” user, also on all three Percona Distribution for PostgreSQL nodes.

sudo su postgres
psql
CREATE EXTENSION citus;

sudo su postgres

psql

CREATE EXTENSION citus;

Now, let’s check if the Citus extension has been correctly created.

# Checking Citus version
select citus_version();

1 2	# Checking Citus version select citus_version();

You will see an output similar to this:

                                           citus_version
----------------------------------------------------------------------------------------------------
 Citus 12.0.0 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
(1 row)

citus_version

----------------------------------------------------------------------------------------------------

Citus 12.0.0 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit

(1 row)

Set up a multi-node Citus cluster

We start by setting the coordinator host. This is used in a Citus distributed database environment to set the IP address and port of the coordinator node. We will take the first node we created as a coordinator host.

This coordinator node manages the overall cluster and coordinates queries and data distribution among other nodes.

sudo su postgres
psql
SELECT citus_set_coordinator_host('172.31.88.41', 5432);

sudo su postgres

psql

SELECT citus_set_coordinator_host('172.31.88.41', 5432);

It is time to add each node to the Citus cluster. Make sure to run these commands with “postgres” user.

SELECT * from citus_add_node('172.31.86.26', 5433);
SELECT * from citus_add_node('172.31.89.45', 5434);

1 2	SELECT * from citus_add_node('172.31.86.26', 5433); SELECT * from citus_add_node('172.31.89.45', 5434);

Check distributed worker nodes

Now, let’s list the distributed nodes in the cluster, including the main node.

sudo su postgres
psql
postgres=# SELECT * FROM pg_dist_node;
 nodeid | groupid |   nodename    | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster | metadatasynced | shouldhaveshards
--------+---------+---------------+----------+----------+-------------+----------+----------+-------------+----------------+------------------
      1 |       0 | 172.31.88.41 |     5432 | default  | t           | t        | primary  | default     | t              | f
      6 |       5 | 172.31.86.26 |     5433 | default  | t           | t        | primary  | default     | t              | t
      7 |       6 | 172.31.89.45 |     5434 | default  | t           | t        | primary  | default     | t              | t
(3 rows)

sudo su postgres

psql

postgres=# SELECT * FROM pg_dist_node;

--------+---------+---------------+----------+----------+-------------+----------+----------+-------------+----------------+------------------

1 | 0 | 172.31.88.41 | 5432 | default | t | t | primary | default | t | f

6 | 5 | 172.31.86.26 | 5433 | default | t | t | primary | default | t | t

7 | 6 | 172.31.89.45 | 5434 | default | t | t | primary | default | t | t

(3 rows)

Or you can also list the active Postgres Citus nodes in the cluster:

postgres=# SELECT * FROM citus_get_active_worker_nodes();
 node_name | node_port
-----------+-----------
 172.31.86.26 |      5433
 172.31.89.45 |      5434
(2 rows)

postgres=# SELECT * FROM citus_get_active_worker_nodes();

node_name | node_port

-----------+-----------

172.31.86.26 | 5433

172.31.89.45 | 5434

(2 rows)

Once this is ready, we will log in to the main PostgreSQL and create our tables “users” and “events”

Creating the database

```sqlsq
-- Create basic users table
CREATE TABLE users (
    user_id serial PRIMARY KEY,
    user_name text,
    email text
);
-- Create basic events table
CREATE TABLE events (
    event_id bigserial,
    user_id int,
    event_time timestamp,
    data jsonb,
    PRIMARY KEY (user_id, event_id)
);
```

```sqlsq

-- Create basic users table

CREATE TABLE users (

user_id serial PRIMARY KEY,

user_name text,

email text

);

-- Create basic events table

CREATE TABLE events (

event_id bigserial,

user_id int,

event_time timestamp,

data jsonb,

PRIMARY KEY (user_id, event_id)

);

```

In this example, we create two tables:

“users” table with the following columns:

user_id: This column will be used for sharding.
user_name: Name of the user text type.
email: Email of the user text type.

And “events” table with the following columns:

event_id: A serial primary key to ensure unique event IDs. This column will be used for sharding.
user_id: This column is to register user’s id.
event_time: A timestamp to record when the event occurred.
data: A JSONB column for storing event-related data.

Creating distributed tables

Now, let’s distribute the “users” and “events” tables across shards placed locally or on the Postgres Citus nodes. In this case, we are distributing through the “user_id” field.

SELECT create_distributed_table('users', 'user_id');
SELECT create_distributed_table('events', 'user_id');

1 2	SELECT create_distributed_table('users', 'user_id'); SELECT create_distributed_table('events', 'user_id');

Insert data into our sharded tables

Now, insert data into the sharded table:

-- Insert user data
INSERT INTO users (user_name, email)
VALUES
    ('user1', 'user1@example.com'),
    ('user2', 'user2@example.com'),
    ('user3', 'user2@example.com');

-- Insert event data
INSERT INTO events ( user_id, event_time, data)
VALUES
    (1, '2023-09-01 10:00:00', '{"temperature": 25.5, "humidity": 62}'),
    (2, '2023-09-01 11:30:00', '{"temperature": 22.0, "humidity": 56}'),
    (1, '2023-09-02 08:15:00', '{"temperature": 24.0, "humidity": 59}'),
    (3, '2023-09-08 07:06:00', '{"temperature": 25.8, "humidity": 66}'),
    (1, '2023-09-10 04:00:00', '{"temperature": 20.0, "humidity": 54}');
    (1, '2023-09-04 10:00:00', '{"temperature": 25.5, "humidity": 62}'),
    (2, '2023-09-05 11:30:00', '{"temperature": 22.0, "humidity": 50}'),
    (1, '2023-09-03 08:15:00', '{"temperature": 24.0, "humidity": 56}'),
    (3, '2023-09-06 07:06:00', '{"temperature": 25.8, "humidity": 65}'),
    (1, '2023-09-09 04:00:00', '{"temperature": 20.0, "humidity": 52}');

-- Insert user data

INSERT INTO users (user_name, email)

VALUES

('user1', '[email protected]'),

('user2', '[email protected]'),

('user3', '[email protected]');

-- Insert event data

INSERT INTO events ( user_id, event_time, data)

VALUES

(1, '2023-09-01 10:00:00', '{"temperature": 25.5, "humidity": 62}'),

(2, '2023-09-01 11:30:00', '{"temperature": 22.0, "humidity": 56}'),

(1, '2023-09-02 08:15:00', '{"temperature": 24.0, "humidity": 59}'),

(3, '2023-09-08 07:06:00', '{"temperature": 25.8, "humidity": 66}'),

(1, '2023-09-10 04:00:00', '{"temperature": 20.0, "humidity": 54}');

(1, '2023-09-04 10:00:00', '{"temperature": 25.5, "humidity": 62}'),

(2, '2023-09-05 11:30:00', '{"temperature": 22.0, "humidity": 50}'),

(1, '2023-09-03 08:15:00', '{"temperature": 24.0, "humidity": 56}'),

(3, '2023-09-06 07:06:00', '{"temperature": 25.8, "humidity": 65}'),

(1, '2023-09-09 04:00:00', '{"temperature": 20.0, "humidity": 52}');

When you use the primary key (user_id) to create a distributed table in Citus, it means you’re effectively “hash-partitioning” the data based on the user_id column. This means that events with the same user_id will be stored together on the same shard.

Citus takes the values in the ID column, applies a hash function to each value, and assigns each row to a shard based on the hash value. This process ensures that rows with similar ID values are distributed across multiple shards in a way that attempts to distribute the data evenly.

Testing and query analysis

Query the sharded table:

SET citus.explain_all_tasks TO on;
EXPLAIN ANALYZE SELECT * FROM events;

1 2	SET citus.explain_all_tasks TO on; EXPLAIN ANALYZE SELECT * FROM events;

“SET citus.explain_all_tasks TO on”; is used to enable the “explain all tasks” feature in the Citus extension for PostgreSQL. This feature is particularly useful when you’re working with distributed queries in a Citus cluster.

“EXPLAIN ANALYZE”, is used to analyze and explain the execution plan of a SQL query in PostgreSQL.

We can see all tasks, and it also shows all Postgres Citus nodes.

We will analyze the first task.

The output showcases the advantage of parallel processing. Citus demonstrated its ability to parallelize operations, distributing the workload across multiple nodes, in this case on the nodes on ports 5433 and 5434, resulting in remarkably fast query execution times.

This was an example of how to use Citus to sharding in “events” table. The main advantage of sharding with Citus is its ability to scale PostgreSQL databases horizontally, spreading data across multiple nodes to accommodate growing data sets and high workloads.

Note: Sharding may not be necessary for all applications; it’s typically most beneficial when dealing with large-scale datasets and high-concurrency workloads.

Check more about Percona PostgreSQL and multi-node Citus, and visit our Percona Community Forum page in case you have questions. We are happy to help!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

0 Comments

Inline Feedbacks

View all comments

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Scalable Solutions with Percona Distribution for PostgreSQL (Part 2): Using Citus

Installing Citus for Distributed PostgreSQL

Add Citus as a PostgreSQL library

Create the Citus extension

Set up a multi-node Citus cluster

Check distributed worker nodes

Creating the database

Creating distributed tables

Insert data into our sharded tables

Testing and query analysis

Related

Related Blog Articles

RECOMMENDED ARTICLES

Trying out the PostgreSQL pg_tde Tech Preview Release

Benchmarking MongoDB Performance on Kubernetes

Why MariaDB Is “Better” Than MySQL

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Scalable Solutions with Percona Distribution for PostgreSQL (Part 2): Using Citus

Installing Citus for Distributed PostgreSQL

Add Citus as a PostgreSQL library

Create the Citus extension

Set up a multi-node Citus cluster

Check distributed worker nodes

Creating the database

Creating distributed tables

Insert data into our sharded tables

Testing and query analysis

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Trying out the PostgreSQL pg_tde Tech Preview Release

Benchmarking MongoDB Performance on Kubernetes

Why MariaDB Is “Better” Than MySQL

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation