Finding Table Differences on Nullable Columns Using MySQL Generated Columns

MySQL generated columns Some time ago, a customer had a performance issue with an internal process. He was comparing, finding, and reporting the rows that were different between two tables. This is simple if you use a LEFT JOIN and an IS NULL comparison over the second table in the WHERE clause, but what if the column could be null? That is why he used UNION, GROUP BY and a HAVING clauses, which resulted in poor performance.

The challenge was to be able to compare each row using a LEFT JOIN over NULL values.

The challenge in more detail

I’m not going to use the customer’s real table. Instead, I will be comparing two sysbench tables with the same structure:

CREATE TABLE `sbtest1` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `k` int(10) unsigned DEFAULT NULL,
  `c` char(120) DEFAULT NULL,
  `pad` char(60) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `k` (`k`,`c`,`pad`)
) ENGINE=InnoDB

CREATE TABLE `sbtest1` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`k` int(10) unsigned DEFAULT NULL,

`c` char(120) DEFAULT NULL,

`pad` char(60) DEFAULT NULL,

PRIMARY KEY (`id`),

KEY `k` (`k`,`c`,`pad`)

) ENGINE=InnoDB

It is sightly different from the original sysbench schema, as this version can hold NULL values. Both tables have the same number of rows. We are going to set to NULL one row on each table:

update sbtest1 set k=null where limit 1;
update sbtest2 set k=null where limit 1;

1 2	update sbtest1 set k=null where limit 1; update sbtest2 set k=null where limit 1;

If we execute the comparison query, we get this result:

mysql> select "sbtest1",a.* from 
    -> sbtest1 a left join 
    -> sbtest2 b using (k,c,pad) 
    -> where b.id is null union 
    -> select "sbtest2",a.* from 
    -> sbtest2 a left join 
    -> sbtest1 b using (k,c,pad)
    -> where b.id is null;
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| sbtest1 | id   | k    | c                                                                                                                       | pad                                                         |
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| sbtest1 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |
| sbtest2 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
2 rows in set (3.00 sec)

mysql> select "sbtest1",a.* from

-> sbtest1 a left join

-> sbtest2 b using (k,c,pad)

-> where b.id is null union

-> select "sbtest2",a.* from

-> sbtest2 a left join

-> sbtest1 b using (k,c,pad)

-> where b.id is null;

+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+

| sbtest1 | id | k | c | pad |

| sbtest1 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |

| sbtest2 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |

2 rows in set (3.00 sec)

As you can see, column k is NULL. In both cases it failed and reported those rows to be different. This is not new in MySQL, but it would be nice to have a way to sort this issue out.

Solution

The solution is based on GENERATED COLUMNS with a hash function (md5) and stored in a binary(16) column:

ALTER TABLE sbtest1 
ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL, 
ADD INDEX (real_id);
ALTER TABLE sbtest2 
ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL,
ADD INDEX (real_id);

ALTER TABLE sbtest1

ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL,

ADD INDEX (real_id);

ALTER TABLE sbtest2

ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL,

ADD INDEX (real_id);

Adding the index is also part of the solution. Now, let’s execute the query using the new column to join the tables:

mysql> select "sbtest1",a.k,a.c,a.pad from
    -> sbtest1 a left join
    -> sbtest2 b using (real_id)
    -> where b.id is null union
    -> select "sbtest2",a.k,a.c,a.pad from
    -> sbtest2 a left join
    -> sbtest1 b using (real_id)
    -> where b.id is null;
Empty set (2.31 sec)

mysql> select "sbtest1",a.k,a.c,a.pad from

-> sbtest1 a left join

-> sbtest2 b using (real_id)

-> where b.id is null union

-> select "sbtest2",a.k,a.c,a.pad from

-> sbtest2 a left join

-> sbtest1 b using (real_id)

-> where b.id is null;

Empty set (2.31 sec)

We can see an improvement in the query performance—it now takes 2.31 sec whereas before it was 3.00 sec—and that the result is as expected. We could say that that’s all, and no possible improvement can be made. However, is not true. Even though the query is running faster, it is possible to optimize it in this way:

mysql> select "sbtest1",a.k,a.c,a.pad
    -> from sbtest1 a
    -> where a.id in (select a.id
    ->   from sbtest1 a left join
    ->   sbtest2 b using (real_id)
    ->   where b.id is null) union
    -> select "sbtest2",a.k,a.c,a.pad
    -> from sbtest2 a
    -> where a.id in (select a.id
    ->   from sbtest2 a left join
    ->   sbtest1 b using (real_id)
    ->   where b.id is null);
Empty set (1.60 sec)

mysql> select "sbtest1",a.k,a.c,a.pad

-> from sbtest1 a

-> where a.id in (select a.id

-> from sbtest1 a left join

-> sbtest2 b using (real_id)

-> where b.id is null) union

-> select "sbtest2",a.k,a.c,a.pad

-> from sbtest2 a

-> where a.id in (select a.id

-> from sbtest2 a left join

-> sbtest1 b using (real_id)

-> where b.id is null);

Empty set (1.60 sec)

Why is this faster? The first query is performing two subqueries. Each subquery is very similar. Let’s check the explain plan:

mysql> explain select "sbtest1",a.k,a.c,a.pad from
    -> sbtest1 a left join
    -> sbtest2 b using (real_id)
    -> where b.id is null;
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref              | rows   | filtered | Extra                                |
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+
|  1 | SIMPLE      | a     | NULL       | ALL  | NULL          | NULL    | NULL    | NULL             | 315369 |   100.00 | NULL                                 |
|  1 | SIMPLE      | b     | NULL       | ref  | real_id       | real_id | 17      | sbtest.a.real_id |     27 |    10.00 | Using where; Not exists; Using index |
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+

mysql> explain select "sbtest1",a.k,a.c,a.pad from

-> sbtest1 a left join

-> sbtest2 b using (real_id)

-> where b.id is null;

+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+

+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+

| 1 | SIMPLE | a | NULL | ALL | NULL | NULL | NULL | NULL | 315369 | 100.00 | NULL |

| 1 | SIMPLE | b | NULL | ref | real_id | real_id | 17 | sbtest.a.real_id | 27 | 10.00 | Using where; Not exists; Using index |

+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+

As you can see, it is performing a full table scan over the first table and using real_id to join the second table. The real_id is a generated column, so it needs to execute the function to get the value to join the second table. That means that it’s going to take time.

If we analyze the subquery of the second query:

mysql> explain select "sbtest1",a.k,a.c,a.pad
    -> from sbtest1 a
    -> where a.id in (select a.id
    ->   from sbtest1 a left join
    ->   sbtest2 b using (real_id)
    ->   where b.id is null);
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+
| id | select_type  | table       | partitions | type   | possible_keys | key        | key_len | ref              | rows   | filtered | Extra                                |
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+
|  1 | SIMPLE       | a           | NULL       | index  | PRIMARY       | k          | 187     | NULL             | 315369 |   100.00 | Using where; Using index             |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_key>    | <auto_key> | 4       | sbtest.a.id      |      1 |   100.00 | NULL                                 |
|  2 | MATERIALIZED | a           | NULL       | index  | PRIMARY       | real_id    | 17      | NULL             | 315369 |   100.00 | Using index                          |
|  2 | MATERIALIZED | b           | NULL       | ref    | real_id       | real_id    | 17      | sbtest.a.real_id |     27 |    10.00 | Using where; Not exists; Using index |
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+

mysql> explain select "sbtest1",a.k,a.c,a.pad

-> from sbtest1 a

-> where a.id in (select a.id

-> from sbtest1 a left join

-> sbtest2 b using (real_id)

-> where b.id is null);

+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+

+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+

| 1 | SIMPLE | a | NULL | index | PRIMARY | k | 187 | NULL | 315369 | 100.00 | Using where; Using index |

| 2 | MATERIALIZED | a | NULL | index | PRIMARY | real_id | 17 | NULL | 315369 | 100.00 | Using index |

| 2 | MATERIALIZED | b | NULL | ref | real_id | real_id | 17 | sbtest.a.real_id | 27 | 10.00 | Using where; Not exists; Using index |

+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+

We are going to see that it is performing a full index scan over the first table, and that the generated column has never been executed. That is how we can go from an inconsistent result of three seconds, to a consistent result of 2.31 seconds, to finally reach a performant query using the faster time of 1.60 seconds.

Conclusions

This is not the first blog post that I’ve done about generated columns. I think that it is a useful feature for several scenarios where you need to improve performance. In this particular case, it’s also presenting a workaround to expected inconsistencies with LEFT JOINS with NULL values. It is also important to mention that this improved a process in a real world scenario.

7 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Jouni Järvinen

5 years ago

Jeez … MD5 is unreliable even if it’s pretty fast. Not storing the hash is a bonus though, saving space.

David Ducos

Author

Reply to Jouni Järvinen

5 years ago

Hi Jouni, thank you for your comment! I chose MD5 for no particular reason, as later in the customers process, there is a second validation. Feel free to choose any hash function that suit the best for your needs.
Cheers

Guilhem Bichot

5 years ago

Hello. The original problem, that two rows with a same NULL value are not considered equal (and thus are false positives in the first difference-finding query), is because the USING clause implicitely translates to equality conditions, which reject NULLs; it can be solved by using a clause with the operator; instead of:
sbtest1 a left join sbtest2 b using (k,c,pad) where b.id is null,
do
sbtest1 a left join sbtest2 b on a.kb.k and a.cb.c and a.padb.pad where b.id is null .
Guilhem (MySQL dev team)

Guilhem Bichot

Reply to Guilhem Bichot

5 years ago

Alas the formatting mangled operators in my query. It’s “on a.k OP b.k etc”, where OP is the NULL-safe equal described here:
https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_equal-to

David Ducos

Author

Reply to Guilhem Bichot

5 years ago

Hi Guihem, you are right, it is possible to use it, I have to be honest, I didn’t test it before to publish this post. However, I must say that, in my preliminary test, the performance didn’t increase significantly as it did with the generated columns which was one of the goals, I will do some more research a update it. We could discuss about the performance on inserts or other performance penalties that the use of generated columns could have, but my idea is come up with out of the box solutions, which is one of the reason why I like so much this feature.

Vishal Saxena

5 years ago

If table has primary keys then “mysqldump” can also be used with where clause to validate the data in chunks between source and destination.”md5sum” can be used on dumped files to compare the checksums. percona tool is also there but i guess that forces “statement” binary logs and may not work in “row level” replication.

Jouni Järvinen

Reply to Vishal Saxena

5 years ago

NEVER use MD5 !

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Finding Table Differences on Nullable Columns Using MySQL Generated Columns

The challenge in more detail

Solution

Conclusions

Related

Related Blog Articles

RECOMMENDED ARTICLES

Did MyDumper LIKE Triggers?

Should You Deploy Your Databases on Kubernetes? And What Makes StatefulSet Worthwhile?

How to Improve Database Performance: The Ultimate Guide

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Finding Table Differences on Nullable Columns Using MySQL Generated Columns

The challenge in more detail

Solution

Conclusions

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Did MyDumper LIKE Triggers?

Should You Deploy Your Databases on Kubernetes? And What Makes StatefulSet Worthwhile?

How to Improve Database Performance: The Ultimate Guide

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation