How To Make Schema Changes and Not Die Trying

Schema changes are required to add new features or to fix bugs in an application. However, there is no standard procedure to make the changes in a quick and safe manner. If the changes are not made considering the necessary precautions, you may face unwanted outages on the database that can cause serious problems to your business. In this blog post, I will delve into the most important things to consider while preparing a schema change.

Table size and concurrency

When assessing a schema change, one of the most important things to consider is the table size and concurrency. For small tables, the ALTER operation usually takes a few milliseconds up to a few seconds. Here is where concurrency plays another important role: if the table has periods of low concurrency during the day and the application allows having it locked for a few seconds or minutes, then it is also a good idea to consider a direct ALTER during that period of time.

How can you define if a table is small, medium, or large? Well, there is no rule of thumb to define the size of a table, but we usually consider that if a table size is less than 1GB of total size, it is considered a small table. If a table is less than 100GB, then it is a medium-sized table, and finally, tables bigger than 100GB are considered large tables. However, if you have a table of 100MB that is highly concurrent, then it is better to wait for a low concurrency period to execute the direct ALTER operation or else use pt-online-schema-change to avoid locking the table for a long time. You can get an estimate of the table size by executing the following query:

mysql> select table_schema, table_name, table_rows, round(data_length / 1024 / 1024) DATA_MB, round(index_length / 1024 / 1024) INDEX_MB, round(data_free / 1024 / 1024) FREE_MB, round(data_length / 1024 / 1024)+round(index_length /
1024 / 1024)+round(data_free / 1024 / 1024) TOTAL_MB from information_schema.tables where table_name = 'sbtest1' and table_schema = 'sbsmall';

### Example output
+--------------+------------+------------+---------+----------+---------+----------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | DATA_MB | INDEX_MB | FREE_MB | TOTAL_MB |
+--------------+------------+------------+---------+----------+---------+----------+
| sbsmall      | sbtest1    |     643705 |     101 |        0 |       4 |      105 |
+--------------+------------+------------+---------+----------+---------+----------+
1 row in set (0.01 sec)

mysql> select table_schema, table_name, table_rows, round(data_length / 1024 / 1024) DATA_MB, round(index_length / 1024 / 1024) INDEX_MB, round(data_free / 1024 / 1024) FREE_MB, round(data_length / 1024 / 1024)+round(index_length /

1024 / 1024)+round(data_free / 1024 / 1024) TOTAL_MB from information_schema.tables where table_name = 'sbtest1' and table_schema = 'sbsmall';

### Example output

+--------------+------------+------------+---------+----------+---------+----------+

+--------------+------------+------------+---------+----------+---------+----------+

| sbsmall | sbtest1 | 643705 | 101 | 0 | 4 | 105 |

+--------------+------------+------------+---------+----------+---------+----------+

1 row in set (0.01 sec)

Metadata locks

There are several blog posts about this locking mechanism and the common problems you may encounter when executing schema changes. These are some of the blog posts that my colleagues have published about these problems:

The most important takeaway from all these blogs is that Metadata Locks are required for database consistency, and there is no way to avoid them, not even using tools like pt-online-schema-change or gh-ost, as those also require a brief lock on the table to be able to function. Another important fact to consider is that having a metadata lock on a highly concurrent table can cause major locking issues in the database and can cause outages.

If the metadata locks cannot be avoided, what can we do to minimize the impact? Well, one of the first things you should do is check if there are long-running transactions in the database that can block the acquisition of the metadata lock. Once you identify the long-running transactions, evaluate if those can be killed. To check if you have a long-running transaction, you can execute the following command; it will return the top five long-running transactions:

SELECT 
    trx.trx_id, 
    CONCAT(trx.trx_started, ' ', @@time_zone, '/', @@system_time_zone) AS trx_started, 
    TIMEDIFF(NOW(), trx.trx_started) AS trx_length, 
    IF(trx.trx_state='LOCK WAIT', CONCAT(trx.trx_state, ' This statement has been waiting for a lock for ', TIMEDIFF(NOW(), trx.trx_wait_started)), trx.trx_state) AS trx_state, 
    trx.trx_isolation_level, 
    trx.trx_operation_state, 
    IF(trx.trx_is_read_only=1, 'This transaction was started as READ ONLY.', '0') AS trx_is_read_only, 
    CONCAT('This transaction has modified and/or inserted ', trx.trx_rows_modified, ' row(s).') AS trx_rows_modified, 
    CONCAT('This transaction has altered and/or locked ~', trx.trx_weight, ' row(s).') AS trx_weight,
    trx.trx_mysql_thread_id AS connection_id, 
    CONCAT(pl.user, '@', pl.host) AS user, 
    pl.command, 
    pl.time, 
    CONCAT('This statement has locked ', trx.trx_rows_locked, ' row(s) in ', trx.trx_tables_locked, ' table(s).') AS stmt_current_locks,
    trx.trx_query 
FROM 
    information_schema.innodb_trx trx 
    JOIN information_schema.processlist pl ON trx.trx_mysql_thread_id = pl.id 
ORDER BY 3 DESC LIMIT 5G

### Example output
*************************** 1. row ***************************
             trx_id: 6798
        trx_started: 2023-08-16 16:00:00 SYSTEM/UTC
         trx_length: 00:06:30
          trx_state: RUNNING
trx_isolation_level: REPEATABLE READ
trx_operation_state: NULL
   trx_is_read_only: 0
  trx_rows_modified: This transaction has modified and/or inserted 1 row(s).
         trx_weight: This transaction has altered and/or locked ~3 row(s).
      connection_id: 14
               user: root@localhost
            command: Sleep
               time: 390
 stmt_current_locks: This statement has locked 1 row(s) in 1 table(s).
          trx_query: NULL
1 row in set (0.00 sec)

SELECT

trx.trx_id,

CONCAT(trx.trx_started, ' ', @@time_zone, '/', @@system_time_zone) AS trx_started,

TIMEDIFF(NOW(), trx.trx_started) AS trx_length,

IF(trx.trx_state='LOCK WAIT', CONCAT(trx.trx_state, ' This statement has been waiting for a lock for ', TIMEDIFF(NOW(), trx.trx_wait_started)), trx.trx_state) AS trx_state,

trx.trx_isolation_level,

trx.trx_operation_state,

IF(trx.trx_is_read_only=1, 'This transaction was started as READ ONLY.', '0') AS trx_is_read_only,

CONCAT('This transaction has modified and/or inserted ', trx.trx_rows_modified, ' row(s).') AS trx_rows_modified,

CONCAT('This transaction has altered and/or locked ~', trx.trx_weight, ' row(s).') AS trx_weight,

trx.trx_mysql_thread_id AS connection_id,

CONCAT(pl.user, '@', pl.host) AS user,

pl.command,

pl.time,

CONCAT('This statement has locked ', trx.trx_rows_locked, ' row(s) in ', trx.trx_tables_locked, ' table(s).') AS stmt_current_locks,

trx.trx_query

FROM

information_schema.innodb_trx trx

JOIN information_schema.processlist pl ON trx.trx_mysql_thread_id = pl.id

ORDER BY 3 DESC LIMIT 5G

### Example output

*************************** 1. row ***************************

trx_id: 6798

trx_started: 2023-08-16 16:00:00 SYSTEM/UTC

trx_length: 00:06:30

trx_state: RUNNING

trx_isolation_level: REPEATABLE READ

trx_operation_state: NULL

trx_is_read_only: 0

trx_rows_modified: This transaction has modified and/or inserted 1 row(s).

trx_weight: This transaction has altered and/or locked ~3 row(s).

connection_id: 14

user: root@localhost

command: Sleep

time: 390

stmt_current_locks: This statement has locked 1 row(s) in 1 table(s).

trx_query: NULL

1 row in set (0.00 sec)

Once you have checked for long-running transactions, you can proceed to execute the ALTER statement directly or use another tool like pt-online-schema-change. When using a direct ALTER command, make sure you first set the session variable “lock_wait_timeout” to a low value before running the ALTER statement. In the following example, this variable is set to five seconds, and the session will abort the ALTER command if it cannot acquire the metadata lock in five seconds:

mysql> SET SESSION lock_wait_timeout=5;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table sbsmall.sbtest1 add column r int null;
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

mysql> SET SESSION lock_wait_timeout=5;

Query OK, 0 rows affected (0.00 sec)

mysql> alter table sbsmall.sbtest1 add column r int null;

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

When using pt-online-schema change, you can also set this variable to a low value to avoid major blocking issues when creating the triggers required by this tool. Please notice in the following example the “–set-vars” option; it sets the variable “lock_wait_timeout” to one second.

pt-online-schema-change --print --progress time,10 
--alter="ADD COLUMN r INT NULL" 
--set-vars tx_isolation='READ-COMMITTED',lock_wait_timeout=1 
h=localhost,D=sbsmall,t=sbtest1 
--execute

pt-online-schema-change --print --progress time,10

--alter="ADD COLUMN r INT NULL"

--set-vars tx_isolation='READ-COMMITTED',lock_wait_timeout=1

h=localhost,D=sbsmall,t=sbtest1

--execute

Type of topology

The topology is important because it is not the same as executing the schema change on a stand-alone server, a classic replication topology, or a Galera or Percona XtraDB Cluster (PXC). When the change is executed on a stand-alone server, there are a few permutations of how to execute the change, but you need to consider the previous points mentioned above.

When executing the change on a replication topology, it is important to consider all nodes when checking for long-running transactions, as those might prevent the change from executing in one of the replicas and cause replication lag. If you are using pt-online-schema-change and having low replication lag is crucial for your application, then you should consider adding the “–recursion-method” and “–max-lag” options to your command. Having those settings in place will make the command check the replication lag and throttle the execution to avoid increasing the lag. You can find more information about these settings in the documentation, but here is a simple example of the usage of such options:

pt-online-schema-change --print --progress time,10 
--alter="ADD COLUMN r INT NULL" 
--set-vars tx_isolation='READ-COMMITTED',lock_wait_timeout=5 
h=localhost,D=sbsmall,t=sbtest1 
--recursion-method=dsn=h=localhost,D=percona,t=dsns 
--max-lag 60 --execute

pt-online-schema-change --print --progress time,10

--alter="ADD COLUMN r INT NULL"

--set-vars tx_isolation='READ-COMMITTED',lock_wait_timeout=5

h=localhost,D=sbsmall,t=sbtest1

--recursion-method=dsn=h=localhost,D=percona,t=dsns

--max-lag 60 --execute

If you decide to execute a direct ALTER rather than using pt-online-schema-change, and the table is not small or with high concurrency, you can execute the change as a rolling upgrade, executing the change on each replica node first following the following high-level steps

Stop the application traffic to the node
Stop replication if needed
Set variable “sql_log_bin” to OFF to avoid recording the change in the binlog
Execute the ALTER statement
Set variable “sql_log_bin” to ON
Start replication if it was stopped
Resume application traffic

Once all replica nodes are updated, you can execute a failover to promote one of the replica nodes as the new primary node and execute the above steps on the former primary node.

Finally, when executing the schema change on a Galera or PXC cluster, you need to consider that if you are going to execute the change using pt-online-schema change, you will want to use the option “–max-flow-ctl” as this will check for flow control events. It will throttle the execution to decrease the flow control events and allow the cluster to catch up. Here is a simple example of the usage of this option:

pt-online-schema-change --print --progress time,10 
--alter="ADD COLUMN r INT NULL" 
--set-vars tx_isolation='READ-COMMITTED',lock_wait_timeout=5 
h=localhost,D=sbsmall,t=sbtest1 
--recursion-method=none --max-flow-ctl=0 
--execute

pt-online-schema-change --print --progress time,10

--alter="ADD COLUMN r INT NULL"

--set-vars tx_isolation='READ-COMMITTED',lock_wait_timeout=5

h=localhost,D=sbsmall,t=sbtest1

--recursion-method=none --max-flow-ctl=0

--execute

Another thing to consider when altering tables on a Galera or PXC cluster is that it tends to stall the whole cluster while the change is applied to all nodes. Consider following the “Manual RSU” approach to perform the schema modifications under this topology type. The following article is a great resource to implement the “Manual RSU”: How to Perform Compatible Schema Changes in Percona XtraDB Cluster (Advanced Alternative)?

Conclusion

There is no free ticket when applying schema changes; you need to analyze and prepare for all the possible drawbacks you might encounter during the implementation. I hope you find this article interesting and helpful so you can avoid finding yourself in the middle of an outage for making a simple ALTER in the database.

If you still have doubts about implementing the changes on your production databases, consider asking for help. Here at Percona, we have several services to help you with your database-specific needs. For example, with Managed Services, we can take care of the difficult changes on your databases while you focus on the most important thing: growing your business. If you have your own DBA team but you need help with more challenging tasks, then Support would be a really good fit for your needs, as you’ll have access to top-class professionals willing to help you troubleshoot the most difficult problems. Finally, if you have a one-off complex task or project that you need help with, contact our Consulting services. Our group of experts will guide you through the whole process to make your project thrive.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

Download Percona Monitoring and Management Today

0 Comments

Inline Feedbacks

View all comments

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

How To Make Schema Changes and Not Die Trying

Table size and concurrency

Metadata locks

Type of topology

Conclusion

Related

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Sets and Sorted Sets

Securing Your MySQL Database: Essential Best Practices

Mastering Database Monitoring: Running PMM in High Availability Mode

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

How To Make Schema Changes and Not Die Trying

Table size and concurrency

Metadata locks

Type of topology

Conclusion

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Sets and Sorted Sets

Securing Your MySQL Database: Essential Best Practices

Mastering Database Monitoring: Running PMM in High Availability Mode

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation