MySQL Table Size Is Way Bigger After Adding a Simple Index; Why?

It is a known good practice to keep only necessary indexes to reduce the write performance and disk space overhead. This simple rule is mentioned briefly in the official MySQL Documentation:

https://dev.mysql.com/doc/refman/8.0/en/optimization-indexes.html

However, in some cases, the overhead from adding a new index can be way above the expectations! Recently, I’ve been analyzing a customer case like that, and it encouraged me to share this kind of example as it surely may be surprising for many developers and even DBAs.

Let’s take this pretty specific table example, which has only three columns. The primary key is created on two of them:

mysql > show create table t1\G

*************************** 1. row ***************************

Table: t1

Create Table: CREATE TABLE `t1` (

`a` varchar(32) NOT NULL,

`b` int unsigned NOT NULL,

`c` varchar(32) NOT NULL,

PRIMARY KEY (`a`,`c`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

1 row in set (0.00 sec)

With this schema, the table having 5M random rows has these statistics (note the Index_lenght):

mysql > show table status like 't1'\G

*************************** 1. row ***************************

Name: t1

Engine: InnoDB

Version: 10

Row_format: Dynamic

Rows: 4969944

Avg_row_length: 104

Data_length: 521125888

Max_data_length: 0

Index_length: 0

Data_free: 2097152

Auto_increment: NULL

Create_time: 2024-01-22 22:39:52

Update_time: NULL

Check_time: NULL

Collation: utf8mb4_0900_ai_ci

Checksum: NULL

Create_options:

Comment:

1 row in set (0.01 sec)

The size on disk (table was optimized to rule out fragmentation):

1 2	$ ls -lh db1/t1.ibd -rw-r----- 1 przemek przemek 508M Jan 22 22:40 db1/t1.ibd

When our queries use column b in the WHERE clause, it is natural that we have to optimize such query by adding an index to the column to avoid this kind of bad execution (i.e., full table scan):

mysql > EXPLAIN select * from t1 where b=10\G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: t1

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 4976864

filtered: 0.00

Extra: Using where

1 row in set, 1 warning (0.00 sec)

So, we add the index:

mysql > alter table t1 add key(b);

Query OK, 0 rows affected (38.68 sec)

Records: 0 Duplicates: 0 Warnings: 0

And the query is very fast now. However, the table size increased by 400 MB!

1 2	$ ls -lh db1/t1.ibd -rw-r----- 1 przemek przemek 908M Jan 22 23:11 db1/t1.ibd

You’d ask — how is it possible that adding an index on one small INT column made that huge growth? The table size increased by +79%, totally not expected given we only indexed the smallest column in the table!

Would you be surprised if I told you that I actually expected it to grow even more? The reason is that a secondary index has the primary key columns appended to its records. Let me quote the documentation here:

In InnoDB, each record in a secondary index contains the primary key columns for the row, as well as the columns specified for the secondary index. InnoDB uses this primary key value to search for the row in the clustered index.
If the primary key is long, the secondary indexes use more space, so it is advantageous to have a short primary key.

Therefore, in this table case, a new index will contain all three columns, effectively duplicating all table values! Before the change, there was only the clustered (primary) index, which holds the whole row data (both its defined columns and the other column(s). The new index on column b includes this column plus columns defined as primary key, so again, all three columns in this specific table case. Therefore, I expected the table to double its size. Let’s investigate why this did not happen and why the table space file did not grow to ~1GB.

With the help of the innodb_ruby tool, I checked the index statistics before the ALTER:

$ innodb_space -f db1/t1.ibd space-indexes

id name root fseg fseg_id used allocated fill_factor

314 4 internal 3 132 159 83.02%

314 4 leaf 4 27680 31648 87.46%

To find out what is the index with id=314, we can use this query:

mysql > select SPACE,INDEX_ID,i.NAME as index_name, t.NAME as table_name,FILE_SIZE from information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLESPACES t USING(space) WHERE t.NAME='db1/t1'\G

*************************** 1. row ***************************

SPACE: 15

INDEX_ID: 314

index_name: PRIMARY

table_name: db1/t1

FILE_SIZE: 532676608

1 row in set (0.00 sec)

So there are ~32k allocated pages in the table. The tool lets you get into many details about the InnoDB tablespace, for example, what is the per page usage, i.e.:

$ innodb_space -f db1/t1.ibd space-index-pages-summary|head -10

page index level data free records

4 314 2 9371 6819 131

5 314 0 15105 1055 189

6 314 0 15101 1063 183

7 314 1 15060 1088 214

8 314 0 15058 1104 184

9 314 0 15121 1041 184

10 314 0 15118 1044 184

11 314 0 15063 1101 180

12 314 0 15072 1092 180

After adding the secondary index, we can see more details about how the new index compares with the primary key:

mysql > show table status like 't1'\G

*************************** 1. row ***************************

Name: t1

Engine: InnoDB

Version: 10

Row_format: Dynamic

Rows: 4981016

Avg_row_length: 104

Data_length: 521125888

Max_data_length: 0

Index_length: 413122560

Data_free: 4194304

Auto_increment: NULL

Create_time: 2024-01-22 23:11:19

Update_time: NULL

Check_time: NULL

Collation: utf8mb4_0900_ai_ci

Checksum: NULL

Create_options:

Comment:

1 row in set (0.00 sec)

$ innodb_space -f db1/t1.ibd space-indexes

id name root fseg fseg_id used allocated fill_factor

314 4 internal 3 132 159 83.02%

314 4 leaf 4 27680 31648 87.46%

315 39 internal 5 104 159 65.41%

315 39 leaf 6 21914 25056 87.46%

The index 315 is the secondary on column “b”:

*************************** 1. row ***************************

SPACE: 15

INDEX_ID: 314

index_name: PRIMARY

table_name: db1/t1

FILE_SIZE: 952107008

*************************** 2. row ***************************

SPACE: 15

INDEX_ID: 315

index_name: b

table_name: db1/t1

FILE_SIZE: 952107008

2 rows in set (0.01 sec)

And we can see it has fewer total pages allocated, around 25k, vs 32k the PK has. So why does the clustered index need more pages to hold the same data values? The index summary shows the difference between the two:

$ innodb_space -f db1/t1.ibd space-index-pages-summary
page        index   level   data    free    records
(...)
6327        314     0       15057   1107    181     
6328        314     0       15138   1026    180     
6329        314     0       15128   1038    178     
6330        314     0       15134   1032    178     
6331        314     0       15135   1031    179     
6332        314     0       15071   1095    179     
6333        314     0       15092   1072    180     
6334        314     0       15072   1094    179     
6335        314     0       15115   1051    179     
6336        314     1       15064   1086    211     
6337        314     1       15098   1052    210     
6338        314     1       15112   1038    211     
6339        314     1       15119   1029    212     
6340        315     0       16072   70      227     
6341        315     0       16135   5       228     
6342        315     1       16132   16      215     
6343        315     0       16118   22      228     
6344        315     0       16097   45      226     
6345        315     0       16074   66      229     
6346        315     0       16093   47      229     
6347        315     0       16091   49      228     
6348        315     0       16073   69      227     
6349        315     0       16092   48      230     
6350        315     0       16095   45      228     
6351        315     0       16133   9       227     
6352        315     0       16126   14      230

$ innodb_space -f db1/t1.ibd space-index-pages-summary

page index level data free records

(...)

6327 314 0 15057 1107 181

6328 314 0 15138 1026 180

6329 314 0 15128 1038 178

6330 314 0 15134 1032 178

6331 314 0 15135 1031 179

6332 314 0 15071 1095 179

6333 314 0 15092 1072 180

6334 314 0 15072 1094 179

6335 314 0 15115 1051 179

6336 314 1 15064 1086 211

6337 314 1 15098 1052 210

6338 314 1 15112 1038 211

6339 314 1 15119 1029 212

6340 315 0 16072 70 227

6341 315 0 16135 5 228

6342 315 1 16132 16 215

6343 315 0 16118 22 228

6344 315 0 16097 45 226

6345 315 0 16074 66 229

6346 315 0 16093 47 229

6347 315 0 16091 49 228

6348 315 0 16073 69 227

6349 315 0 16092 48 230

6350 315 0 16095 45 228

6351 315 0 16133 9 227

6352 315 0 16126 14 230

The innodb_ruby tool allows us to see that the secondary index (id 315) is able to store more records on one page as compared to the clustered index (id 314). The latter leaves more free space per page. This explains why duplicating values did not exactly cause duplicating the table space size. The tool allows us to illustrate this nicely, using GNUplot:

One important side effect of copying primary key values is that column b’s index is a covering one in our table case!

This is why we can see Using index in the extra info, even though the index is on one column only:

mysql > EXPLAIN select * from t1 where b=10\G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: t1

partitions: NULL

type: ref

possible_keys: b

key: b

key_len: 4

ref: const

rows: 1

filtered: 100.00

Extra: Using index

1 row in set, 1 warning (0.00 sec)

A conclusion from this is we want smaller primary keys if possible. Let’s modify this table accordingly:

mysql > alter table t1 drop primary key, add column id int primary key auto_increment;

Query OK, 0 rows affected (16.66 sec)

Records: 0 Duplicates: 0 Warnings: 0

This reduces the table size from 908M to:

1 2	$ ls -lh db1/t1.ibd -rw-r----- 1 przemek przemek 608M Jan 23 19:12 db1/t1.ibd

(The disk space was automatically reclaimed because changing the primary key effectively re-creates the table.)

But it does not make much sense if other queries filter via columns a and c, as we’d need to add another index to satisfy them, which would amplify the overall size even more due to the large size of those columns:

mysql > alter table t1 add key a_c(a,c);

Query OK, 0 rows affected (45.33 sec)

Records: 0 Duplicates: 0 Warnings: 0

1 2	$ ls -lh db1/t1.ibd -rw-r----- 1 przemek przemek 1004M Jan 23 19:16 db1/t1.ibd

On the other hand, this schema change will benefit other operations, like ROW replication (simple INT PK lookup vs compound varchar lookup), checksumming, etc. The modified table definition looks like this:

mysql > show create table t1\G

*************************** 1. row ***************************

Table: t1

Create Table: CREATE TABLE `t1` (

`a` varchar(32) NOT NULL,

`b` int unsigned NOT NULL,

`c` varchar(32) NOT NULL,

`id` int NOT NULL AUTO_INCREMENT,

PRIMARY KEY (`id`),

KEY `b` (`b`),

KEY `a_c` (`a`,`c`)

) ENGINE=InnoDB AUTO_INCREMENT=5001368 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

1 row in set (0.00 sec)

The new index statistics reflect how the optimized primary key makes the same secondary index (id=323) size overhead way smaller – 4.7k pages instead of 25k pages:

*************************** 1. row ***************************

SPACE: 19

INDEX_ID: 322

index_name: PRIMARY

table_name: db1/t1

FILE_SIZE: 1052770304

*************************** 2. row ***************************

SPACE: 19

INDEX_ID: 323

index_name: b

table_name: db1/t1

FILE_SIZE: 1052770304

*************************** 3. row ***************************

SPACE: 19

INDEX_ID: 324

index_name: a_c

table_name: db1/t1

FILE_SIZE: 1052770304

3 rows in set (0.00 sec)

$ innodb_space -f db1/t1.ibd space-indexes

id name root fseg fseg_id used allocated fill_factor

322 4 internal 3 27 27 100.00%

322 4 leaf 4 29003 33184 87.40%

323 5 internal 5 6 6 100.00%

323 5 leaf 6 4157 4768 87.19%

324 32354 internal 7 104 159 65.41%

324 32354 leaf 8 21913 25056 87.46%

The above example is a bit not typical, but my goal was to remind of an often forgotten behavior of the InnoDB engine with respect to indexes. The overhead from adding secondary indexes is proportionally bigger the larger the primary key is! The differences in disk usage can be huge, so keep it in mind while designing your tables.

For the same reason, the pretty common practice of using UUIDs for primary keys will hurt the performance and overall cost of storing the data. In case there is a real need to use UUIDs, maybe you can implement the nice workarounds described here:

Storing UUID Values in MySQL

Storing UUID and Generated Columns

And cut the UUID overhead significantly!

I encourage you to check more details on good practices for choosing the primary key in these blog posts:

Tuning InnoDB Primary Keys

Illustrating Primary Key models in InnoDB and their impact on disk usage

Queries for Finding Poorly-Designed MySQL Schemas and How to Fix Them