One of the most underlooked parameters to configure MySQL is server_id, which is an integer number to identify a server inside a replication topology uniquely. Note that two servers within a replication set can’t have the same server_id value.

It is generally set up as a “random” number, just different from the one configured on the other replicas, and once it is set up, it is generally not reviewed or changed later, which is fine, but if overlooked, it could lead to an unwanted skip of transactions in a recovery scenario like the one I will describe below.

Suppose we have the following topology:

Once the traffic is moved from the previous primary (db1), we can start the maintenance on it. During the maintenance, there was a disk issue, leaving the data in MySQL inconsistent.

Don’t worry, we have the backups taken from db2 earlier that same day, let’s use that!!

  • We copy the backup from db2 using rsync into db1.
  • We run the steps on how to restore a compressed backup from here.
  • We’re starting MySQL with the same configuration that db1 had before the crash. (server_id=81).
  • When setting up replication, we can use the replication positions from xtrabackup_binlog_info, as we need the coordinates from db2 at the time that the consistent backup was taken.

Between the backup was taken and the time of the restore, many binlogs were generated that needed to be applied, so we expect the replication to take some time to fully catch up with the primary.

Once the replication was configured, we noticed that the binlogs were advancing too fast.

Suppose we configured replication using the following positions:

We review the replication status:

Noticing something odd? Yes, Seconds_behind_master: 0, even when there are around 40+ binlogs to apply.

It kept running at a “fast pace”; it advanced 40 binlogs in a matter of a couple of minutes, until it failed:

Why? Was our backup broken?

Upon checking the binlogs, it’s found that the binlogs events that were processed “too fast” were generated by the server_id=81, which is the same as the current replica (as those records were generated on db1, which was the primary when the backup was taken). Once these records are tried to be applied on a server with the same server_id, those are skipped. 

Then, once the failover was done, the binlogs events are now being generated with the server_id=82 (db2). This is when it started to actually apply the events, but since it skipped multiple events from the same server_id, it missed many updates and inserts to the DB, leading it to an inconsistent state.

In this case, there’s no other option but to retry the restoration, this time changing the server_id value on the restore server before starting replication.

Another approach that could have been applied, other than changing the server ID, was to enable the option replicate-same-server-id.

As per the documentation:

“This option is for use on replicas. The default is 0 (FALSE). With this option set to 1 (TRUE), the replica does not skip events that have its own server ID. This setting is normally useful only in rare configurations.”

Conclusion

The scenario described in this blog may not apply to all restorations, but it is a good reminder to review the server_id when performing restores on a replication environment. It can avoid headaches!

It was also mentioned briefly the parameter “replicate-same-server-id” is not recommended to be used unless we’re sure what we’re doing.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments