Disaster recovery is not optional for businesses operating in the digital age. With the ever-increasing reliance on data, system outages or data loss can be catastrophic, causing significant business disruptions and financial losses.

With multi-cloud or multi-regional PostgreSQL deployments, the complexity of managing disaster recovery only amplifies. This is where the Percona Operators come in, providing a solution to streamline disaster recovery for PostgreSQL clusters running on Kubernetes. With the Percona Operators, businesses can manage multi-cloud or hybrid-cloud PostgreSQL deployments with ease, ensuring that critical data is always available and secure, no matter what happens.

In this article, you will learn how to set up disaster recovery with Percona Operator for PostgreSQL version 2.

Overview of the solution

Operators automate routine tasks and remove toil. For standby, Operator provides the following options:

  1. pgBackrest repo-based standby
  2. Streaming replication
  3. Combination of (1) and (2)

We will review the repo-based standby as the simplest one:

1. Two Kubernetes clusters in different regions, clouds, or running in hybrid mode (on-prem + cloud). One is Main, and the other is Disaster Recovery (DR).

2. In each cluster, there are the following components:

    1. Percona Operator
    2. PostgreSQL cluster
    3. pgBackrest
    4. pgBouncer

3. pgBackrest on the Main site streams backups and Write Ahead Logs (WALs) to the object storage.

4. pgBackrest on the DR site takes these backups and streams them to the standby cluster.

Configure main site

Use your favorite method to deploy the Operator from our documentation. Once installed, configure the Custom Resource manifest so that pgBackrest starts using the Object Storage of your choice. Skip this step if you already have it configured.

Configure the backups.pgbackrest.repos section by adding the necessary configuration. The below example is for Google Cloud Storage (GCS):

main-pgbackrest-secrets contains the keys for GCS; please read more about the configuration in the backup and restore tutorial.

Once configured, apply the custom resource:

The backups should appear in the object storage. By default, pgBackrest puts them into the pgbackrest folder.

Configure DR site

The configuration of the disaster recovery site is similar to the Main, with the only difference in standby settings.

The following manifest has standby.enabled set to true and points to the repoName where backups are (GCS in our case):

Deploy the standby cluster by applying the manifest:

Failover

In case of Main site failure or in other cases, you can promote the standby cluster. The promotion effectively allows writing to the cluster. This creates a net effect of pushing Write Ahead Logs (WALs) to the pgBackrest repository. It might create a split-brain situation where two primary instances attempt to write to the same repository. To avoid this, make sure the primary cluster is either deleted or shut down before trying to promote the standby cluster.

Once the primary is down or inactive, promote the standby by changing the corresponding section:

Now you can start writing to the cluster.

Split brain

There might be a case where your old primary comes up and starts writing to the repository. To recover from this situation, do the following:

  1. Keep only one primary with the latest data running
  2. Stop the writes on the other one
  3. Take the new full backup from the primary and upload it to the repo

Automating the failover

Automated failover consists of multiple steps and is outside of the Operator’s scope. There are a few steps that you can take to reduce the Recovery Time Objective (RTO). To detect the failover, we recommend having a third site for monitoring both DR and Main. In this case, you can be sure that Main really failed, and it is not a network split situation.

Another aspect of automation is to switch the traffic for the application from Main to Standby after promotion. It can be done through various Kubernetes configurations and heavily depends on how your networking and application are designed. The following options are quite common:

  1. Global Load Balancer – various clouds and vendors provide their solutions
  2. Multi-cluster Services or MCS – available on most of the public clouds
  3. Federation or other multi-cluster solutions

Conclusion

Percona Operator for PostgreSQL provides high availability for database clusters by design, making it a robust and production-ready solution for multi-AZ deployments. At the same time, business continuity protocols require disaster recovery plans in place where your vital processes and applications can survive regional outages. In this blog post, we saw how Kubernetes and Operators can simplify your DR design. Try it out yourself, and let us know your experience at the Community Forum.

For more information, visit Percona Operator for PostgreSQL v2 documentation page. For commercial support, please visit our contact page.

 

Try Percona Operator for PostgreSQL today!

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments