Estimates vary, but most reports put the average cost of unplanned database downtime at approximately $300,000 to $500,000 per hour, or $5,000 to $8,000 per minute. With so much at stake, database high availability and fault tolerance have become must-have items, but many companies just aren’t certain which one they must have.

This blog article will examine shared attributes of high availability (HA) and fault tolerance (FT). We’ll also look at the differences, as it’s important to know what architecture(s) will help you best meet your unique requirements for maximizing data assets and achieving continuous uptime.

We’ll wrap it up by suggesting high availability open source solutions, and we’ll introduce you to support options for ensuring continuous high performance from your systems.

What does high availability mean?

High availability refers to the continuous operation of a database system with little to no interruption to end users in the event of system failures, power outages, or other disruptions. A basic high availability database system provides failover (preferably automatic) from a primary database node to redundant nodes within a cluster.

High availability does not guarantee 100% uptime, but an HA system enables you to minimize downtime to the point you’re almost there. Within IT, the gold standard for high availability is 99.999%, or “five-nines” of availability, but the level of HA needed really depends on how much system downtime you can bear. Streaming services, for example, run mission-critical systems in which excessive downtime could result in significant financial and reputational losses for the business. But many organizations can tolerate a few minutes of downtime without negatively affecting their end users.

The following table shows the amount of downtime for each level of availability.

high availability

Implementing high availability systems: How does it work?

High availability works through a combination of key elements. Some of the most important elements include:

  • No single point of failure (SPOF): You must eliminate any SPOF in the database environment, including any potential for an SPOF in physical or virtual hardware.
  • Redundancy: Critical components of a database are duplicated so that if one component fails, the functionality continues by using a redundant component. For example, in a server cluster, multiple servers are used to host the same application so that if one server fails, the application can continue to run on the other servers.
  • Load balancing: Traffic is distributed across multiple servers to prevent any one component from becoming overloaded. Load balancers can detect when a component is not responding and put traffic redirection in motion.
  • Failure detection: Monitoring mechanisms detect failures or issues that could lead to failures. Alerts report failures or issues so that they are addressed immediately.
  • Failover: This involves automatically switching to a redundant component when the primary component fails. If a primary server fails, a backup server can take over.

High availability architecture

Two previously mentioned absolutes — no SPOF and foolproof failover — must apply across the following areas if an HA architecture is to be achieved:

  • Infrastructure — This is the hardware that database systems rely on. Without enough infrastructure (physical or virtualized servers, networking, etc.), there cannot be high availability.
  • Topology management — This is the software management related specifically to the database and managing its ability to stay consistent in the event of a failure.
  • Connection management — This is the software management related specifically to the networking and connectivity aspect of the database. Clustering solutions typically bundle with a connection manager. However, in asynchronous clusters, deploying a connection manager is mandatory for high availability.
  • Backup and continuous archiving — This is of extreme importance if any replication delay happens and the replica node isn’t able to work at the primary’s pace. The backed-up and archived files can also be used for point-in-time recovery.

What is fault tolerance?

Fault tolerance refers to the ability of a database system to continue functioning in full, with no downtime, amid hardware or software failures. When a failure event occurs — such as a server failure, power outage, or network disruption — a fault-tolerant system will ensure that data integrity is preserved and that the system remains operational.

Implementing fault-tolerant systems: How does it work?

Here are some of the key components and characteristics of a fault-tolerant database environment:

  • Replication: Data is replicated across multiple nodes or servers, so if one node fails, the data remains accessible from replicas. Replication can be synchronous (ensuring immediate consistency) or asynchronous (allowing some delay).
  • Redundancy: The system stores multiple copies of data on separate devices. Redundancy provides backups and safeguards against data loss in case of hardware failures.
  • Error detection and correction: Techniques such as checksums, parity bits, and error-correcting codes are used to detect and correct errors that might occur during data transmission or storage.
  • Data integrity checks: Various mechanisms, such as checksums or hashing algorithms, are used to verify the integrity of stored data. This ensures that data remains consistent and accurate even in the presence of errors.
  • Specialized hardware: Specialized hardware is used to detect a hardware fault and initiate an immediate switch to a redundant hardware component.

Architecture for fault-tolerant systems

Fault-tolerant information systems are designed to offer 100% availability. Some of the key elements of such designs include:

  • Backup hardware systems — The hardware system is backed up by systems that are the same or perform the same functions but are at separate data centers and are not dependent on any of the same physical sources of power and functionality.
  • Backup software systems — As with hardware, there is no loss of software functionality because there are mirrored copies performing the same functions but residing elsewhere and using a different source of power.
  • Containerization platforms — Today, companies are increasingly using containerization platforms such as Kubernetes to run multiple instances of the same software. By doing so, if an error or other problem forces the software to go offline, traffic can be routed to other instances, and application functionality continues. 
  • Alternate power sources — To withstand power outages, businesses have alternate sources on standby. Some businesses, for example, have powerful generators ready to roll if electricity is lost.
  • Alternate environments — Fault-tolerant information systems can include separate cloud and on-premises databases that perform the same functions. For companies without their own physical servers, replicated cloud databases are maintained in different regions to provide contingencies for power outages.

High availability vs. Fault tolerance: Understanding the differences

There’s obviously a lot in common in terms of setup, functionality, and purpose. So with all the similarities (replication, load balancing, redundant components, and more), what are the differences? 

In general terms, the purpose of a high availability solution is to minimize downtime and provide continuous access to the database, while the purpose of fault tolerance is to maintain system functionality and data integrity at all times, including during failure events or faults. 

Still, generally speaking, but in financial terms, achieving fault tolerance is more costly, and the payoff often does not justify the expense. For example, it takes a lot of time, money, and expertise to completely mirror a system so that if one fails, the other takes over without any downtime. It can be considerably cheaper to establish a high availability database system in which there’s not total redundancy, but there is load balancing that results in minimal downtime (mere minutes a year).   

Basically, it comes down to a company’s needs, what’s at stake, and what fits its budget. Here are key questions to consider when deciding between high availability and fault tolerance:

How much downtime can your company endure? 

With high availability, you can achieve the gold standard previously mentioned — and your database system will be available 99.999% of the time. 

With a fault-tolerant system, you can spend a lot more and perhaps do better than the 5.26 minutes of downtime (in an entire year) that come with the “five nines” described immediately above. But is it mission-critical to do better than 99.999% availability?

How much complexity and how many redundant components are you willing to take on? 

High availability systems typically have redundant servers or clusters to ensure that if one component fails, another can take over seamlessly. 

Fault tolerance incorporates multiple versions of hardware and software, and it also includes power supply backups. The hardware and software can detect failures and instantly switch to redundant components, but they also constitute more complexity, parts, and cost.

Which option fits your budget?

A high availability database system requires redundancy, load balancing, and failover. It also must ensure that there is no single point of failure. But depending on your needs, there are varying levels of high availability, and an architecture can be fairly simple. 

Fault-tolerant systems are designed with more complex architectures, requiring sophisticated hardware and software components, along with specialized expertise to design, configure, and maintain. The additional complexity adds to the cost of the system.

Percona HA expertise and architectures

At Percona, we advise that, in most cases, the attributes and cost-effectiveness of high availability are the way to go. It’s plenty available and plenty scalable. We’re also big advocates of using open source software that’s free of vendor lock-in and is backed by the innovation and expertise of the global open source community to achieve high availability. 

At the same time, we’re aware that achieving high availability using open source software takes extra effort. HA doesn’t come with the complexity and price tag of a fault-tolerant system, but it still requires considerable time and expertise. 

So instead of you having to select, configure, and test architecture for building an HA database environment, why not use ours? You can use Percona architectures on your own, call on us as needed, or have us do it all for you.

Check out these helpful whitepapers, which include ready-to-use architectures:

Percona Distribution for MySQL: High Availability With Group Replication

Percona Distribution for PostgreSQL: High Availability With Streaming Replication

High availability support you can count on

Percona designs, configures, and deploys high availability databases so that applications always have access to essential data. We’ll help you ensure that applications stay up, databases stay bug-free, and your entire system runs at optimal performance levels.

 

Percona High Availability Database Support

FAQs

What is the difference between DR vs. HA and FT?

Disaster recovery focuses on recovering from major disruptive events and restoring operations. Fault tolerance and high availability focus on building systems that can withstand failures and continue operating with minimal interruption (HA) and without interruption (FT).

What are the goals of high availability and fault tolerance?

The goal of employing fault tolerance is to maintain system functionality and data integrity at all times, including during failures or faults. The goal of employing a high availability solution is to minimize downtime and provide continuous access to the database. 

What is the difference between HA and DR?

High availability is used to prevent downtime due to individual component failures within a live environment, while disaster recovery focuses on recovering from major events that make the database inaccessible or non-functional.

Subscribe
Notify of
guest

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Paul

In this article you summarize the differences between HA and FT but neglect to explain why a system designer would choose FT. The tradeoffs described don’t explain clearly to one unfamiliar with the nomenclature that an HA system can only operate in a degraded state during faults unless consistency can be safely repaired afterward. An HA system desiring consistency must reliably detect the cluster state and reject writes during any fault that might involve a network partition if it is to maintain consistency.

The biggest challenge in the FT vs HA tradeoff is determining where consistency is required and how strictly it must be maintained. Systems handling financial data usually can’t afford to be inconsistent, even temporarily, which is a large part of why they have been so slow to adopt distributed architectures, instead preferring DR where there is only one possible history of transactions.