MySQL generated columnsSome time ago, a customer had a performance issue with an internal process. He was comparing, finding, and reporting the rows that were different between two tables. This is simple if you use a LEFT JOIN and an  IS NULL  comparison over the second table in the WHERE clause, but what if the column could be null? That is why he used UNION, GROUP BY and a HAVING clauses, which resulted in poor performance.

The challenge was to be able to compare each row using a LEFT JOIN over NULL values.

The challenge in more detail

I’m not going to use the customer’s real table. Instead, I will be comparing two sysbench tables with the same structure:

It is sightly different from the original sysbench schema, as this version can hold NULL values. Both tables have the same number of rows. We are going to set to NULL one row on each table:

If we execute the comparison query, we get this result:

As you can see, column k is NULL. In both cases it failed and reported those rows to be different. This is not new in MySQL, but it would be nice to have a way to sort this issue out.

Solution

The solution is based on GENERATED COLUMNS with a hash function (md5) and stored in a binary(16) column:

Adding the index is also part of the solution. Now, let’s execute the query using the new column to join the tables:

We can see an improvement in the query performance—it now takes 2.31 sec whereas before it was 3.00 sec—and that the result is as expected. We could say that that’s all, and no possible improvement can be made. However, is not true. Even though the query is running faster, it is possible to optimize it in this way:

Why is this faster? The first query is performing two subqueries. Each subquery is very similar. Let’s check the explain plan:

As you can see, it is performing a full table scan over the first table and using real_id to join the second table. The real_id is a generated column, so it needs to execute the function to get the value to join the second table. That means that it’s going to take time.

If we analyze the subquery of the second query:

We are going to see that it is performing a full index scan over the first table, and that the generated column has never been executed. That is how we can go from an inconsistent result of three seconds, to a consistent result of 2.31 seconds, to finally reach a performant query using the faster time of 1.60 seconds.

Conclusions

This is not the first blog post that I’ve done about generated columns. I think that it is a useful feature for several scenarios where you need to improve performance. In this particular case, it’s also presenting a workaround to expected inconsistencies with LEFT JOINS with NULL values. It is also important to mention that this improved a process in a real world scenario.

7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Jouni Järvinen

Jeez … MD5 is unreliable even if it’s pretty fast. Not storing the hash is a bonus though, saving space.

Guilhem Bichot

Hello. The original problem, that two rows with a same NULL value are not considered equal (and thus are false positives in the first difference-finding query), is because the USING clause implicitely translates to equality conditions, which reject NULLs; it can be solved by using a clause with the operator; instead of:
sbtest1 a left join sbtest2 b using (k,c,pad) where b.id is null,
do
sbtest1 a left join sbtest2 b on a.kb.k and a.cb.c and a.padb.pad where b.id is null .
Guilhem (MySQL dev team)

Guilhem Bichot

Alas the formatting mangled operators in my query. It’s “on a.k OP b.k etc”, where OP is the NULL-safe equal described here:
https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_equal-to

Vishal Saxena

If table has primary keys then “mysqldump” can also be used with where clause to validate the data in chunks between source and destination.”md5sum” can be used on dumped files to compare the checksums. percona tool is also there but i guess that forces “statement” binary logs and may not work in “row level” replication.

Jouni Järvinen

NEVER use MD5 !