To make data count and to ensure cloud computing is unabated, companies and organizations must have highly available databases. That doesn’t mean downtime is nonexistent, but it must be limited so that significant/irreparable damage (as defined by the business or industry) does not occur.

This guide provides an overview of what high availability means, the components involved, how to measure high availability, and how to achieve it.

Defining high availability

In general terms, high availability refers to the continuous operation of a system with little to no interruption to end users in the event of hardware or software failures, power outages, or other disruptions. A basic high availability database system provides failover (preferably automatic) from a primary database node to redundant nodes within a cluster.

HA is sometimes confused with “fault tolerance.” Although the two are related, the key difference is that an HA system provides quick recovery of all system components to minimize downtime. Some disruption might occur, but it will be minimal. Fault tolerance aims for zero downtime and data loss. As such, fault tolerance is more expensive to implement because it requires dedicated infrastructure that completely mirrors the primary system.

In general terms, again, when it comes to database reliability, the stakes are high. Database downtime can hurt or doom any company with anything to do with the internet. More specifically, a 2022 ITIC survey found that downtime costs more than $300,000 per hour for 91% of small-, mid-size, and large enterprises. Among just the mid-size and large respondents, 44% percent said a single hour of downtime could potentially cost them more than $1 million.

Introducing key parts of high availability

  • Data backup and recovery: A database backup is created and stored in another location to help keep the system operational after a natural disaster, a cyber attack, or other problem.
  • Data replication: Data is continually copied from one database to another to ensure that the system remains operational even if one database fails.
  • Clustering: Multiple nodes work together (and separately if necessary) to provide access to data, ensuring that the system remains operational even if one node fails. (More in the following sub-section.)
  • Load balancing: Requests are evenly distributed across multiple database servers, ensuring the system remains operational even if one server fails.
  • Automated failover: To keep the database operational and minimize downtime, it automatically switches to a backup server if the primary server fails.

More about high availability clusters

HA clusters — also known as failover clusters — are groups of interconnected servers that work together to ensure that an application or service remains available to users. Redundancy and failover protection ensure that if one server fails or goes offline, another server will automatically take over its workload. Typically, the servers are configured in a primary/replica configuration, with one server designated as the primary server that handles all incoming requests and the others designated as replica servers that monitor the primary and take over its workload if it fails.

Additionally, most database setups these days include a read-only mode or some capabilities that help distribute the load. With PostgreSQL, when using physical replication, all the replicas will be in read-only mode. Logical replication has the capability to allow the writes, but it’s not recommended to have writes in both nodes at the same time without a strong conflict-resolution solution. You can, alternatively, set specific tables or a database to read-only mode using the pg_hba.conf file, or you can set read-only access permissions. With MySQL, you can set the whole instance to a read-only mode using a specific parameter in the software’s configuration. You can also restrict write operations by restricting users to read-only privileges pertaining to specific databases or tables. With MongoDB, you can set read-only privileges in which users can query data but cannot perform write operations.

HA clusters are used in web hosting, email hosting, database hosting, and other services that require high availability.

How does high availability work?

High availability works through a combination of the following:

  • No single point of failure (SPOF): You must eliminate any single point of failure in the database environment, including physical or virtual hardware the database system relies on that would cause it to fail.
  • Redundancy: Critical components of a database are duplicated so that if one component fails, the functionality continues by using a redundant component. For example, in a server cluster, multiple servers are used to host the same application so that if one server fails, the application can continue to run on the other servers.
  • Load balancing: Traffic is distributed across multiple servers to prevent any one component from becoming overloaded. Load balancers can detect when a component is not responding and put traffic redirection in motion.
  • Failure detection: Monitoring mechanisms detect failures or issues that could lead to failures. Alert mechanisms report those failures or issues so that they are addressed immediately.
  • Failover: This involves automatically switching to a redundant component when the primary component fails. If a primary server fails, a backup server can take over and continue to serve requests.

Components of high availability infrastructure

Multiple copies of data

Data redundancy helps prevent data loss due to hardware or software failures. By storing multiple copies of data, database administrators can recover lost or corrupted data and maintain data integrity. Redundancy is also critical for disaster recovery. In the event of a natural disaster, cyber attack, or other catastrophic event, redundant systems can help ensure that the database can be quickly restored to a functioning state.

Multiple servers or databases

By copying and maintaining data on multiple servers or databases, database administrators can ensure that there is always a backup copy available. They can use that replicated data to quickly restore the database to a functioning state in case of a system failure, attack, or other potentially damaging event.

Load balancers

Load balancers help distribute incoming traffic across multiple database servers, ensuring that no single server is overloaded with requests. This helps prevent the possibility of a single point of failure, which could cause downtime or slow performance. It also supports the flexibility and scalability of the database infrastructure.

Failover mechanisms

The primary purpose of failover — the automatic process of switching from a failed or offline database server to a standby or backup server — is to ensure user access and functionality of databases without interruption, even if there is a hardware or software failure. By implementing failover mechanisms, organizations can ensure that their critical applications and services remain available even in the event of a failure, minimizing the impact on their business operations.

How to achieve high availability

Not all companies are the same, nor are their requirements for HA. When planning your database HA architecture, the size of your company is a great place to start to assess your needs. For example, if you’re a small business, paying for a disaster recovery site outside your local data center is probably unnecessary and may cause you to spend more money than the data loss is worth. All companies, regardless of size, should consider how to strike a balance between availability goals and cost.

  • Startups and small businesses: Most startups and small businesses can achieve an effective HA infrastructure within a single data center on a local node. This base architecture keeps the database available for your applications in case the primary node goes down, whether that involves automatic failover in case of a disaster or planned switchover during a maintenance window.
  • Medium to large businesses: If you have enough additional budget, consider adding a disaster recovery site outside your local data center. This architecture spans data centers to add more layers of availability to the database cluster. It keeps your infrastructure available and your data safe and consistent even if a problem occurs in the primary data center. In addition to the disaster recovery site, this design includes an external layer of nodes. If communication between the sites is lost, the external node layer acts as a “source of truth” and decides which replica to promote as a primary. In doing so, it keeps the cluster healthy by preventing a split-brain scenario and keeps the infrastructure highly available.
  • Enterprises: An enterprise HA architecture adds another layer of insurance for companies that offer globally distributed services and could face devastating revenue and reputational losses in the event of substantial downtime. This architecture features two disaster recovery sites, adding more layers for the infrastructure to stay highly available and keep applications up and running. Based on tightly coupled database clusters spread across data centers and geographic availability zones, this architecture can offer 99.999% uptime when used with synchronous streaming replication, the same hardware configuration in all nodes, and fast internode connections.

How to measure high availability

High availability does not guarantee 100% uptime, but it allows you to get pretty close. Within IT, the gold standard for high availability is 99.999%, or “five-nines” of availability, but the level of HA needed really depends on how much downtime you can bear. Streaming services, for example, run mission-critical systems in which excessive downtime could result in significant financial and reputational losses for the business. But many organizations can tolerate a few minutes of downtime without negatively affecting their end users.

The following table shows the amount of downtime for each level of availability, from “two nines” to “five nines.”

high availability

Here are additional metrics used to determine the reliability of a database, make adjustments that minimize downtime, and set benchmarks for meeting business continuity requirements. The desired value or range for each of these depends on the specific needs and risk tolerance of the company or organization that uses it:

  • Mean time between failures (MTBF) — a measure of the average time between two failures of a database or component. The MTBF is calculated by dividing the total operating time of the system by the number of failures that have occurred during that time.
  • Mean downtime (MDT) — the average amount of time that a database system is unavailable due to planned or unplanned downtime. The MDT is calculated by adding up the total amount of time that the system was down over a given period and dividing it by the number of downtimes.
  • Recovery time objective (RTO) — the maximum time that a system can be down before it starts to have a significant effect on business operations. Organizations use the RTO to determine how quickly they must recover their systems to meet business continuity requirements.
  • Recovery point objective (RPO) — the point in time by which an organization must recover its data in case of a disaster or system failure. For example, an RPO of 1 hour means that an organization can tolerate a maximum data loss of 1 hour’s worth of data. To ensure that the RPO is met, organizations typically implement backup and recovery procedures that enable them to restore data to a specific point in time.

High availability architecture for open source databases

No matter where a database infrastructure is located (on-premises, public or private cloud environment, or hybrid environment) or what technology is used — be it MySQL, PostgreSQL, or MongoDB — it’s vital to eliminate single points of failure. To be prepared for disaster recovery and attain high availability, server redundancy must be built into the database architecture.

And for that redundancy to be beneficial, the design of the database infrastructure must incorporate failover mechanisms that duplicate essential database functions and components.

The considerations described above (and in other sections) must be addressed across the following areas if a high availability architecture is to be achieved:

  • Infrastructure: This is the physical or virtual hardware that database systems rely on to run. Without enough infrastructure (physical or virtualized servers, networking, etc.), there cannot be high availability.
  • Topology management: This is the software management related specifically to the database and managing its ability to stay consistent in the event of a failure.
  • Connection management: This is the software management related specifically to the networking and connectivity aspect of the database. Clustering solutions typically bundle with a connection manager. However, in asynchronous clusters, deploying a connection manager is mandatory for high availability.
  • Backup and continuous archiving: This is of extreme importance if any replication delay happens and the replica node isn’t able to work at the primary’s pace. The backed-up and archived files can also be used for point-in-time recovery if any disaster occurs.

Which database is best for high availability?

When choosing a database for high availability, some proprietary options offer attractive specialized features, including built-in replication, automatic failover, and advanced clustering. Those ready-to-go capabilities can simplify the implementation of high availability. Dedicated vendor support can also be a lure. Proprietary vendors can provide expertise, guidance, and timely resolutions that ensure the stability and availability of a database.

But with the right expertise — whether it’s from within your organization or through outside assistance — open source databases can provide the same benefits, as well as other advantages, including:

  • Cost-effectiveness: Open source software is typically free to use, which can significantly reduce the overall cost of establishing high availability. Conversely, proprietary databases often require expensive licenses and ongoing maintenance fees.
  • Flexibility: With open source databases, you have access to the source code, allowing you to modify and optimize the database to meet your specific high availability requirements. This flexibility can be crucial in designing a scalable architecture.
  • Community support: Large and active communities of users, developers, and DBAs can be invaluable when it comes to troubleshooting issues, finding solutions, and staying up-to-date with the latest developments in high availability techniques. The transparency of open code enhances security as potential vulnerabilities can be identified and patched more quickly by the community.
  • Freedom from vendor lock-in: With open source software, you are not tied to a specific vendor or locked into a particular technology stack. You have the freedom to choose and integrate various tools and technologies that best suit your high availability strategy.

HA is an important goal, but switching to a proprietary database solely for HA limits the other benefits of open source database software. In addition to enabling a strong database HA architecture, open source avoids costly licensing fees, offers data portability, gives you the freedom to deploy anywhere, anytime you want, and delivers great software designed by a community of contributors who prioritize innovation and quality.

Getting started with open source database high availability

There are numerous open source options available, but covering all of them here would not be practical. (For a more detailed examination of open source options, read The Ultimate Guide to Open Source Databases.)

Here, for this examination of high availability, we are focusing on three of the most popular open source options:

  • MySQL: This relational (table-based) database software supports asynchronous replication, which provides redundancy and allows for failover in case the primary database becomes unavailable. MySQL supports the creation of read replicas, which can distribute the read workload, improve performance, and provide high availability for read-intensive applications. Another feature (among many) of note is MySQL Cluster, a distributed, shared-nothing database clustering solution. It combines synchronous replication, automatic data partitioning, and node-level failover to provide high availability and scalability.
  • PostgreSQL: This relational database software supports numerous high availability options, including asynchronous streaming replication, logical replication, and tools such as repmgr and Patroni for automatic failover and cluster management. PostgreSQL also supports synchronous replication in which the primary server waits for confirmation from at least one standby server before acknowledging a transaction. PostgreSQL point-in-time recovery uses continuous archiving to create a backup that allows you to recover the database to a specific state.
  • MongoDB: While not open source (see our blog Is MongoDB Open Source? Is Planet Earth Flat? for more), we’ll categorize it here for easy reading. This non-relational (document-based) database uses replica sets to provide high availability. MongoDB’s replica sets maintain multiple copies of data across different nodes. The primary node receives write operations and replicates them to the secondary nodes asynchronously or synchronously, depending on the chosen replication mode. As a result, data is redundantly stored, reducing the risk of data loss.

Some open source databases include great options for HA but generally don’t include a built-in HA solution. That means you’ll need to review the various extensions and tools available carefully. Those extensions and tools can be excellent for enriching their respective databases, but as your environment scales, the database might not keep pace with evolving, more complex requirements.

Greater complexity means that you and your team will need certain skills and knowledge. For example, it might be necessary to write application connectors, integrate multiple systems, and align business requirements with your HA solution. You should also understand open source databases, applications, and infrastructure. Consider the possibility of needing outside expertise to help you manage your HA architecture.

Comparing high availability to similar systems

High Availability vs. Disaster Recovery

High availability is focused on preventing downtime and ensuring that the database remains available, while disaster recovery is focused on recovering from a catastrophic event and minimizing negative effects on the business. High availability typically involves redundant hardware, software, application, and network components that can quickly take over if the primary component fails, while disaster recovery typically involves regular backups, replication to a secondary site, and a well-defined recovery plan that outlines the steps to be taken in the event of a disaster. Companies certainly can and should incorporate high availability and disaster recovery into database architecture.

High Availability vs. Fault Tolerance

High availability ensures that the database remains up and running almost all the time, while fault tolerance is focused on preventing data loss or corruption in the event of a fault or error. Redundancy and failover mechanisms figure prominently in high availability architecture, a design that gives users continued access even if there is a hardware or software failure. A fault-tolerant system is designed to correct and detect errors as they occur such that normal operation is not interrupted. This can mean that instead of failing completely, a system continues to operate, but possibly at a reduced level of efficiency.

High Availability vs. Redundancy

Redundancy is a key element of high availability, so it’s not an either-or situation. As stated, high availability refers to a database that’s functional almost all the time (i.e., “five nines” or some other standard depending on company size and requirements). In other words, there’s little or no downtime. Redundancy, a specific part of high availability, refers to the duplication of critical components or data to provide a backup in case the primary component or data is lost or corrupted. To establish redundancy, the same data must be accessible from at least two places. That means having a primary system and a secondary system. The secondary system includes failover monitoring, so when a failure occurs, data traffic is redirected to that secondary system. In addition to redundancy, high availability includes load balancing, failure detection, and failover, as well as ensuring there is no single point of failure (SPOF).

Best practices for achieving high availability

For medium, large, and enterprise-size companies (as well as startups and small companies, depending on resources), the following general practices can help when planning, implementing, maintaining, and testing high availability architectures. Please note that these are presented as general best practices. (For more detailed information, check out the Percona architecture and solution links in subsequent sections.) Best practices include:

Conduct research and identify objectives

Early on, determine what you need and what you can afford. Here are the early steps:

  • Perform analysis: A good place to start might be to conduct a Failure Mode Analysis (FMA). An FMA or similar analysis can identify possible failure points in the database infrastructure. This will help in determining the redundancy requirements of the different components.
  • Determine requirements: The level of HA that is needed depends on how much downtime a company can bear. For example, some companies that offer streaming services will suffer significant financial and reputational losses in just a couple of minutes of downtime. Other organizations can tolerate a few minutes of downtime without negatively affecting their end users.
  • Consider costs: Multiple factors will, of course, affect how much you should budget. Depending on your setup, costs can include:
    • Hardware devices (servers, storage devices, network switches, etc.)
    • Networking equipment (switches, routers, etc.)
    • Software for clustering, replication, load balancing, failover, etc.; with proprietary licensing, costs can increase significantly as you scale
    • Cloud egress fees that might be charged when moving data out of a vendor’s network
    • Staffing and training (an HA system requires specialized expertise)
    • Testing to ensure the system will function during likely and unlikely conditions
    • Scalability to allow for maximum growth

Focus on redundancy and failover

Whether a company’s database infrastructure is on-premises, in a public or private cloud environment, or is run in a hybrid environment, it is mission-critical that there are no single points of failure. To avoid them, achieve high availability, and be poised for disaster recovery, there must be server redundancy. And for redundancy to matter, the database infrastructure design must include failover. In other words, there must be mechanisms for putting the duplication of essential database functions and components into action.

To establish redundancy, the same data must be accessible from at least two places. That means having a primary system and a secondary system. The secondary system includes failover monitoring, so when a failure occurs, data traffic is redirected to that secondary system.

Use shared-nothing (or shared-none) clusters

High availability can be achieved using the concepts of “shared-all” and “shared-nothing” (also called “shared-none”) clusters. Open source technologies often prefer to use shared-nothing clusters because they are not only simpler to implement and deploy but also to scale. By incorporating that setup, the developer or DBA will ensure that each node within a cluster stores, manages, and processes its data. That will minimize problems that can occur when nodes share access to and the ability to modify the same memory or storage.

The most common shared-nothing architecture has a primary node responsible for receiving all data updates and propagating the updates to the replica nodes with little to no interference or conflict with other nodes. (Each node has its own cache buffer.) If one instance fails, another instance in another node can fill the role entirely, and high availability remains.

In addition to preventing single points of failure and system-wide shutdowns, a big advantage of a shared-nothing architecture is scalability. You can easily scale a shared-nothing system by adding nodes. Since those nodes don’t share data, there won’t be dueling processes that could slow or halt operations.

Apply load balancing

In addition to incursions from the outside and errors generated in-house, unusually high customer use can overwhelm a company’s database operations. That’s why it’s essential to include load balancing in your design. In times of high use, load balancing redirects workloads to components of the infrastructure best suited to pick up the extra work.

To keep services — and the databases that deliver them — running smoothly, a tiered architecture of server clusters helps. In the event that a server fails, a replica server in a different cluster can assume the role of the failed server. The load balancer is the key; it serves as a traffic cop of sorts, directing and redirecting flow from end users/applications to servers where data will flow as intended. If a server fails, the load balancer redirects traffic to a replicated server in another cluster.

Use a container-orchestration platform

Containers can provide multi-cloud compatibility and cost-efficiency, and they are a great tool for isolating applications. But containers are not faster than anything else with the same resources. They can, in fact, be a lot more complex to troubleshoot than the processes of virtual machines (VMs) and bare metal environments. Still, companies find themselves adding more and more containers across multiple hosts.

To get real benefits from those containers, it’s advisable to use a system of orchestrating and managing containerized applications. Kubernetes is a platform that can provide that system. It enables developers and DBAs to manage sizable containers across multiple hosts. Using Kubernetes can yield many benefits, including greater automation than that of traditional VMs, enhanced fault tolerance, and increased flexibility and scalability.

Deploy in multiple regions

Sometimes there’s a misperception that “cloud” means safe. Just because you’ve entrusted your database with a cloud provider instead of hosting on-site doesn’t mean you won’t lose service. Cloud provider data centers go down, and so do entire regions of data centers. So whether you host your own servers, you use a cloud provider, or you use a combination of both, it’s wise to replicate databases across regions, not in the same region.

With replication in different geographic regions, you can assure instantaneous failover in the event of an entire region going down.

Maintain deployment consistency

Using proven and secure code and processes is important in maintaining data integrity and reliability. Deployment consistency is essential in preventing the introduction of bugs and other intrusions that could lead to data corruption or data loss and could slow or interrupt applications and even entire systems. By maintaining database deployment consistency, organizations can ensure that their data is always up-to-date and accurate and that their applications perform reliably. Data consistency is especially important in financial and healthcare systems, where incorrect or outdated data can have serious and even fatal consequences.

Conduct end-to-end (E2E) testing

This practice tests the entire system to ensure that all database components and software applications (and any subsystems) work as intended — individually and in unison. E2E testing verifies that components meet performance, security, and usability requirements. It is often used to identify issues that arise from the interaction between different components of the system, such as when data is passed from one component to another. This type of testing requires a solid understanding of user behaviors and goals, which can be varied. E2E testing, therefore, can be time-consuming and potentially more expensive than some limited budgets allow.

Incorporate chaos engineering

Chaos engineering is testing that involves intentionally introducing failures into a software system to identify potential issues and weaknesses. By intentionally introducing problems throughout the database infrastructure, engineers can identify failure points and create solutions — avoiding costly downtime and potential loss of frustrated customers.

In well-orchestrated chaos engineering, there’s an experimental version of the system and a control version. A barrage of problems and outright attacks are injected into the system. The technical team then reviews the logging data to determine how the incursions affected the system.

Chaos engineering is designed to uncover unexpected, even shocking, surprises. Unlike targeted testing of traditional DevOps procedures, chaos engineering is widespread throughout the entire system. Since it’s performed on live systems, there is a chance that users can be briefly affected, but chaos engineering is a great motivator for following resilient design and maintenance practices. Most of all, chaos engineering is a way to fend off disaster.

Consider outside help

Consider the possibility that you might need outside expertise to help you plan, create, and manage an HA architecture. If an open source solution is right for you, but you need help implementing it, Percona recommends that you research to ensure the outside company is committed to the global open source community. Also, make sure you’ll have the freedom to innovate and to scale — up or down — as your company sees fit. Avoid unnecessary vendor lock-in. If you don’t use Percona, find another company that gives you those freedoms while helping you build the high availability infrastructure you need.

Percona architectures for highly available MySQL and PostgreSQL

Percona uses open source software to design architectures for high availability databases. Businesses of all sizes — small, medium, large, and enterprise — can use our designs to create their own architectures and ensure that applications and data stay available and in sync. If you need help, call on us for on-demand support as you implement an architecture. Or we can do it for you completely.

Here are resources (with viewable architectures) for help getting started:

High availability MySQL architectures

Check out Percona architecture and deployment recommendations, along with a technical overview, for a MySQL solution that provides a high level of high availability and assumes the usage of high read/write applications.

Percona Distribution for MySQL: High Availability with Group Replication

 

If you need even more high availability in your MySQL database, check out Percona XtraDB Cluster (PXC).

High availability PostgreSQL architectures

Access a PostgreSQL architecture description and deployment recommendations, along with a technical overview, of a solution that provides high availability for mixed-workload applications.

 

Percona Distribution for PostgreSQL: High Availability with Streaming Replication

 

High availability support you can count on

Percona designs, configures, and deploys high availability databases so that applications always have access to essential data. Then, we apply the same level of expertise and commitment to ensuring your databases continue to run smoothly and continuously. We’ll help you ensure that applications stay up, databases stay bug-free, and your entire system runs at optimal performance levels.

 

Percona High Availability Database Support

 

FAQ

The following are commonly asked questions and short answers about high availability in databases. More detailed answers are presented in the sections above.

What does high availability mean?

High availability refers to a database’s ability to remain operational and accessible to users without significant downtime or interruptions. To achieve high availability, a database system must be designed with redundancy and failover mechanisms and have no single points of failure. High availability is particularly important in the e-commerce, financial, and healthcare industries.

What is an example of high availability?

Database replication is an example of high availability. Multiple copies of the same database are maintained on different servers so that if one server fails, the database can still be accessed from another server. Different techniques are used to achieve database high availability, including primary-replica replication or multi-primary replication.

How does high availability work?

High availability of a database infrastructure works through a combination of redundancy, load balancing, failure detection, failover, and having no single point of failure (SPOF). Those elements combine to ensure that data is always accessible and that there is minimal downtime or interruption in service, even in the face of hardware or software failures.

What does 99.99% availability mean?

99.99% availability (or “four nines”) means that a database is down no more than 52.60 minutes per year, 4.38 minutes per month, 1.01 minutes per week, and 8.64 seconds per day.

What does 99.999% availability mean?

99.999% availability (or “five nines”) means that a database is down no more than 5.26 minutes per year, 26.30 seconds per month, 6.05 seconds per week, and 864.00 milliseconds per day.

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments