Schema changes are required to add new features or to fix bugs in an application. However, there is no standard procedure to make the changes in a quick and safe manner. If the changes are not made considering the necessary precautions, you may face unwanted outages on the database that can cause serious problems to your business. In this blog post, I will delve into the most important things to consider while preparing a schema change.

Table size and concurrency

When assessing a schema change, one of the most important things to consider is the table size and concurrency. For small tables, the ALTER operation usually takes a few milliseconds up to a few seconds. Here is where concurrency plays another important role: if the table has periods of low concurrency during the day and the application allows having it locked for a few seconds or minutes, then it is also a good idea to consider a direct ALTER during that period of time.

How can you define if a table is small, medium, or large? Well, there is no rule of thumb to define the size of a table, but we usually consider that if a table size is less than 1GB of total size, it is considered a small table. If a table is less than 100GB, then it is a medium-sized table, and finally, tables bigger than 100GB are considered large tables. However, if you have a table of 100MB that is highly concurrent, then it is better to wait for a low concurrency period to execute the direct ALTER operation or else use pt-online-schema-change to avoid locking the table for a long time. You can get an estimate of the table size by executing the following query:

Metadata locks

There are several blog posts about this locking mechanism and the common problems you may encounter when executing schema changes. These are some of the blog posts that my colleagues have published about these problems:

The most important takeaway from all these blogs is that Metadata Locks are required for database consistency, and there is no way to avoid them, not even using tools like pt-online-schema-change or gh-ost, as those also require a brief lock on the table to be able to function. Another important fact to consider is that having a metadata lock on a highly concurrent table can cause major locking issues in the database and can cause outages.

If the metadata locks cannot be avoided, what can we do to minimize the impact? Well, one of the first things you should do is check if there are long-running transactions in the database that can block the acquisition of the metadata lock. Once you identify the long-running transactions, evaluate if those can be killed. To check if you have a long-running transaction, you can execute the following command; it will return the top five long-running transactions:

Once you have checked for long-running transactions, you can proceed to execute the ALTER statement directly or use another tool like pt-online-schema-change. When using a direct ALTER command, make sure you first set the session variable “lock_wait_timeout” to a low value before running the ALTER statement. In the following example, this variable is set to five seconds, and the session will abort the ALTER command if it cannot acquire the metadata lock in five seconds:

When using pt-online-schema change, you can also set this variable to a low value to avoid major blocking issues when creating the triggers required by this tool. Please notice in the following example the “–set-vars” option; it sets the variable “lock_wait_timeout” to one second.

Type of topology

The topology is important because it is not the same as executing the schema change on a stand-alone server, a classic replication topology, or a Galera or Percona XtraDB Cluster (PXC). When the change is executed on a stand-alone server, there are a few permutations of how to execute the change, but you need to consider the previous points mentioned above.

When executing the change on a replication topology, it is important to consider all nodes when checking for long-running transactions, as those might prevent the change from executing in one of the replicas and cause replication lag. If you are using pt-online-schema-change and having low replication lag is crucial for your application, then you should consider adding the “–recursion-method” and “–max-lag” options to your command. Having those settings in place will make the command check the replication lag and throttle the execution to avoid increasing the lag. You can find more information about these settings in the documentation, but here is a simple example of the usage of such options:

If you decide to execute a direct ALTER rather than using pt-online-schema-change, and the table is not small or with high concurrency, you can execute the change as a rolling upgrade, executing the change on each replica node first following the following high-level steps

  • Stop the application traffic to the node
  • Stop replication if needed
  • Set variable “sql_log_bin” to OFF to avoid recording the change in the binlog
  • Execute the ALTER statement
  • Set variable “sql_log_bin” to ON
  • Start replication if it was stopped
  • Resume application traffic

Once all replica nodes are updated, you can execute a failover to promote one of the replica nodes as the new primary node and execute the above steps on the former primary node.

Finally, when executing the schema change on a Galera or PXC cluster, you need to consider that if you are going to execute the change using pt-online-schema change, you will want to use the option “–max-flow-ctl” as this will check for flow control events. It will throttle the execution to decrease the flow control events and allow the cluster to catch up. Here is a simple example of the usage of this option:

Another thing to consider when altering tables on a Galera or PXC cluster is that it tends to stall the whole cluster while the change is applied to all nodes. Consider following the “Manual RSU” approach to perform the schema modifications under this topology type. The following article is a great resource to implement the “Manual RSU”: How to Perform Compatible Schema Changes in Percona XtraDB Cluster (Advanced Alternative)?

Conclusion

There is no free ticket when applying schema changes; you need to analyze and prepare for all the possible drawbacks you might encounter during the implementation. I hope you find this article interesting and helpful so you can avoid finding yourself in the middle of an outage for making a simple ALTER in the database.

If you still have doubts about implementing the changes on your production databases, consider asking for help. Here at Percona, we have several services to help you with your database-specific needs. For example, with Managed Services, we can take care of the difficult changes on your databases while you focus on the most important thing: growing your business. If you have your own DBA team but you need help with more challenging tasks, then Support would be a really good fit for your needs, as you’ll have access to top-class professionals willing to help you troubleshoot the most difficult problems. Finally, if you have a one-off complex task or project that you need help with, contact our Consulting services. Our group of experts will guide you through the whole process to make your project thrive.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments