This blog post will focus on failover and recovery scenarios inside the InnoDB Cluster and ClusterSet environment. To know more about the deployments of these topologies, you can refer to the manuals – InnoDB Cluster and Innodb ClusterSet setup.

In the below snippet, we have two clusters (cluster1 and cluster2), which are connected via an async channel and combined, known as a ClusterSet topology. We are going to use the below topology in all of our cases.

How failover happens inside a single InnoDB Cluster

  • Connect to any node of the first cluster (“cluster1”) via MySQLShell and fetch the details.


     
  • Now perform a primary switchover from the instance (“127.0.0.1:3308”) to (“127.0.0.1:3309”).

Output:

  • Finally, the instance (“127.0.0.1:3309”) will show the status as primary.

How to rejoin the lost instance again

If, for some reason, an instance leaves the cluster or loses connection and can’t automatically rejoin the cluster, we might need to rejoin an instance to a cluster by issuing the “Cluster.rejoinInstance(instance)” command. Here, we will try to create a small example that can demonstrate the usage of this command.

  • Create some blocker by stopping group replication on the instance (“127.0.0.1:3310”).

  • Looking over the information below, we can see the instance (“127.0.0.1:3310”) is showing “MISSING” status.

  • Now, add the instance (“127.0.0.1:3310”) again with the rejoinInstance() command.

And after the above operation, the instance seems to be “ONLINE” now.

How to recover a cluster from a quorum or vote loss

There are situations when the running instance can fail, and a cluster can lose its quorum (ability to vote) due to insufficient members. This could trigger when there is a failure of enough/majority of instances that make up the cluster to vote on Group Replication operations. In case a cluster loses the quorum, it can no longer process any write transactions within the cluster or change the cluster’s topology.

However, if any instance is online that contains the latest InnoDB Cluster metadata, it is possible to restore a cluster with the quorum.

Let’s see how we can use the feature “forceQuorumUsingPartitionOf” to recover the cluster again with minimal member votes.

  • First, we will try to fail the majority of nodes with a simple “KILL” operation.

  • Now check the cluster1 status again. The cluster lost the quorum, and no write activity was allowed.

  • To fix this situation, we can connect to the available instance (“127.0.0.1:3309”) and reset the quorum again with the below command.

  • The Instances with port (3308,3310) are still down/failed. However, we recovered the cluster quorum with a single primary member (“127.0.0.1:3309”). Now, we can perform the write activity on the cluster without any issues.

Later on, we can fix the failed instances, and they will join cluster1 again. Sometimes, we might need to perform the “rejoinInstance()” operation in order to add the nodes again.

How to recover a complete cluster from a major outage

Sometimes, even if all the nodes are up and some internal issues happen, like group replication is stuck or some networking problem, you might experience a complete outage and be unable to perform any writes/activity on the cluster.

In such circumstances, you can use any one node and use its metadata to recover the cluster. You need to connect to the most up-to-date instance, as otherwise, you may lose data or have inconsistency in the cluster nodes.

Let’s see how we can introduce such a situation manually and then try to fix that.

  • First, stop the group replication on all three instances.

Now we are completely stuck, and even if we try to perform any writes on the last primary member, we see the below error since the “–super-read-only” is enabled in order to protect the trxs on the database.

  • Now, we will fix the issue with the below set of steps.

1. Connect to the instance with most executions/Gtid’s. We can confirm the same by connecting to each instance and comparing the GTIDs with the below command.

2. In our case, it is the same in all three instances, so we can choose any of the nodes. Let’s choose the last primary (“127.0.0.1:3309”).

3. Finally, connect to the instance (“127.0.0.1:3309”) and execute the below command to recover from the outage/failure.

Note: Be careful with the force option as it can bypass other important checks like GTID_SET or instance reachability.

Now, if we check the status again, we see all three nodes are up now. The recovery command fixes all problems (group replication started) and rejoins the instances again.

How to perform switchover/failover from one cluster to another in a ClusterSet

Sometimes, we need to perform maintenance and update activities on the database instances. To avoid a long downtime or major impact on production, we do some control switchover so the concerned node can be offline without impacting any running workload.

Inside ClusterSet, we can perform such activity by changing the clusterRole from “REPLICA” to “PRIMARY” among the running clusters.

Let’s see the exact scenario below.

  • Fetch the ClusterSet information.

  • Performing a switchover from “cluster1” to “cluster2”.

  • Now, if we see the status again, we can observe the change in the ClusterRole.

In some unlucky situations, when one of the clusters (“cluster1”) is completely down, and we don’t have any choice, we might need to do an emergency failover. The below command could be handy in those cases.

In case you are using the MySQLRouter interface for routing traffic in the ClusterSet then you also need to change the Router option with (“setRoutingOption”) so the traffic can be diverted to the new Primary Cluster (“cluster2”) otherwise the application will fail and not able to communicate.

In the next scenario, we will see the usage of “setRoutingOption” in more detail.

How to change traffic routes with MySQLRouter

In the case of Cluster:

Within a single cluster or “cluster1,” the writes will fall to the Primary and reads will fall to the secondary in a balanced order. The router will automatically detect the failovers and choose the Primary.

In the case of ClusterSet:

We can switch the Primary stack for routing traffic from “cluster1” to “cluster2” with the help of the below steps.

  • Verifying the current configurations.

  • Now, switching the target cluster from “cluster1” to “cluster2”.

  • Checking the status again, we can see the target cluster is now changed to “cluster2”.

How to rejoin the lost cluster again in the ClusterSet

There are situations when the cluster has been marked as invalidated, or there might be some issue with the replication channel. To fix that, we might need to perform the “rejoinCluster()” process in order to add the same to the ClusterSet.

We can simulate the same by using the below steps.

  • Stopping Async replication channel on the Primary instance (“127.0.0.1:3311”) of cluster2, which is syncing from the instance (“127.0.0.1:3309”) of cluster1.

  • If we check the status again, we see the replication is stopped now on the cluster2 instance.

  • Now, we will fix the same by running the below rejoin command.

So, the stopped replication started automatically and synced with the main cluster.

Summary

Here, we have discussed a few scenarios and methodologies that could be very useful in order to recover the cluster nodes and perform some manual failovers in the topology. The above-discussed options would be used in both InnoDB Cluster and Innodb ClusterSet-related environments.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Subscribe
Notify of
guest

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Leonard Britolli

Finally, I found. thx