For a long time, MyDumper has been the fastest tool to take Logical Backups. We have been adding several features to expand the use cases. Masquerade was one of these features, but it was only for integer and UUID values. In this blog post, I’m going to present a new functionality that is available in MyDumper and will be available in the next release: we added the possibility to build random data based on a format that the user defines.

How does it work?

During export, mydumper sends SELECT statements to the database. Each row is written one by one as an INSERT statement. Something important that you might not know, is that each column of a row can be transformed by a function. When you execute a backup, the default function is the identity function, as nothing needs to be changed. The function, which can be configured inside the defaults file, will change the content of the column before writing the row into disk.

How can we select the column to masquerade?

I think that the most valuable element of this feature is the simplicity to define which column will be modified and how you want to mask it.  The format is:

In the section name, you add the schema and table name surrounded by backticks and separated by a dot. Then, each key-value entry will keep in the key the column name surrounded by backticks, and the value will be the masking function definition.

New random format function

Having string, integer, and UUID is nice to have, but what about build dynamic data with a specific format? As we want more realistic data, we want to build dynamically world wide addresses, phone numbers, emails, etc. The new function has this syntax:

This are some examples:

Performance considerations

You should expect performance degradation if you compare masquerade backups and regular backups. It is impossible to measure the impact as it will depend on the amount of data that needs to be masked. However, I tried to give you an idea through an example over a sysbench table of 10M rows.

Baseline backup

We are going to split by rows and compress with ZSTD:

It took near 19.9 seconds to complete, and here is an example of the output:

One integer column

We are going to use random_int over the k column, which in the configuration will be:

The backup took 20.7 seconds, an increase of 4%:

And as you can see, the data in the second column has changed:

random_format with <number 11>

Now, we are going to use the last column (pad) and the number tag with 11 digits to simulate the values:

We can see that it took 36.6 seconds to complete, and the values in the latest column have changed:

Take into consideration that 11 digits forced us to execute two times g_random_int, this means that if we have:

It will take 29 seconds.

random_format with <file> with 100 lines file

In this case, the configuration will be:

And it will take 34 seconds:

Warning

This is not a fully tested feature in MyDumper; you should consider it as Beta. However, I found it relevant to show the potential that it might have for the community.

Conclusion

Never has it been as easy to build a new masquerade environment as we can do now with MyDumper.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments