It is a known good practice to keep only necessary indexes to reduce the write performance and disk space overhead. This simple rule is mentioned briefly in the official MySQL Documentation:

https://dev.mysql.com/doc/refman/8.0/en/optimization-indexes.html

However, in some cases, the overhead from adding a new index can be way above the expectations! Recently, I’ve been analyzing a customer case like that, and it encouraged me to share this kind of example as it surely may be surprising for many developers and even DBAs.

Let’s take this pretty specific table example, which has only three columns. The primary key is created on two of them:

With this schema, the table having 5M random rows has these statistics (note the Index_lenght):

The size on disk (table was optimized to rule out fragmentation):

When our queries use column b in the WHERE clause, it is natural that we have to optimize such query by adding an index to the column to avoid this kind of bad execution (i.e., full table scan):

So, we add the index:

And the query is very fast now. However, the table size increased by 400 MB!

You’d ask — how is it possible that adding an index on one small INT column made that huge growth? The table size increased by +79%, totally not expected given we only indexed the smallest column in the table!

Would you be surprised if I told you that I actually expected it to grow even more? The reason is that a secondary index has the primary key columns appended to its records. Let me quote the documentation here:

 In InnoDB, each record in a secondary index contains the primary key columns for the row, as well as the columns specified for the secondary index. InnoDB uses this primary key value to search for the row in the clustered index.

If the primary key is long, the secondary indexes use more space, so it is advantageous to have a short primary key.

Therefore, in this table case, a new index will contain all three columns, effectively duplicating all table values! Before the change, there was only the clustered (primary) index, which holds the whole row data (both its defined columns and the other column(s). The new index on column b includes this column plus columns defined as primary key, so again, all three columns in this specific table case. Therefore, I expected the table to double its size. Let’s investigate why this did not happen and why the table space file did not grow to ~1GB. 

With the help of the innodb_ruby tool, I checked the index statistics before the ALTER:

To find out what is the index with id=314, we can use this query:

So there are ~32k allocated pages in the table. The tool lets you get into many details about the InnoDB tablespace, for example, what is the per page usage, i.e.:

After adding the secondary index, we can see more details about how the new index compares with the primary key:

The index 315 is the secondary on column “b”:

And we can see it has fewer total pages allocated, around 25k, vs 32k the PK has. So why does the clustered index need more pages to hold the same data values? The index summary shows the difference between the two:

The innodb_ruby tool allows us to see that the secondary index (id 315)  is able to store more records on one page as compared to the clustered index (id 314). The latter leaves more free space per page. This explains why duplicating values did not exactly cause duplicating the table space size. The tool allows us to illustrate this nicely, using GNUplot:

One important side effect of copying primary key values is that column b’s index is a covering one in our table case!

This is why we can see Using index in the extra info, even though the index is on one column only:

A conclusion from this is we want smaller primary keys if possible. Let’s modify this table accordingly:

This reduces the table size from 908M to:

(The disk space was automatically reclaimed because changing the primary key effectively re-creates the table.)

But it does not make much sense if other queries filter via columns a and c, as we’d need to add another index to satisfy them, which would amplify the overall size even more due to the large size of those columns:

On the other hand, this schema change will benefit other operations, like ROW replication (simple INT PK lookup vs compound varchar lookup), checksumming, etc. The modified table definition looks like this:

The new index statistics reflect how the optimized primary key makes the same secondary index (id=323) size overhead way smaller – 4.7k pages instead of 25k pages:

The above example is a bit not typical, but my goal was to remind of an often forgotten behavior of the InnoDB engine with respect to indexes. The overhead from adding secondary indexes is proportionally bigger the larger the primary key is! The differences in disk usage can be huge, so keep it in mind while designing your tables. 

For the same reason, the pretty common practice of using UUIDs for primary keys will hurt the performance and overall cost of storing the data. In case there is a real need to use UUIDs, maybe you can implement the nice workarounds described here:

Storing UUID Values in MySQL

UUIDs are Popular, but Bad for Performance — Let’s Discuss

Storing UUID and Generated Columns

And cut the UUID overhead significantly!

I encourage you to check more details on good practices for choosing the primary key in these blog posts:

Tuning InnoDB Primary Keys

Illustrating Primary Key models in InnoDB and their impact on disk usage

Queries for Finding Poorly-Designed MySQL Schemas and How to Fix Them

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments