Bloom filters are an essential component of an LSM-based database engine like MyRocks. This post will illustrate through a simple example how bloom filters work in MyRocks.

Why?

With MyRocks/RocksDB, data is stored in a set of large SST files.  When MyRocks needs to find the value associated with a given key, it uses a bloom filter to guess if the key could potentially be in an SST file.

How?

A bloom filter is a space-efficient way of storing information about a list of keys. At its base, there is a bitmap and a hash function.  The hash value of the keys stored in an SST is computed, and the results are used to set some bits to “1” in the bitmap.  When you want to know if a key is present or not in the list, you run it through the hash function and check if the corresponding bits in the bitmap are “1” or “0”. If one of the bits is “0”, you know for sure the key is not in the list. If all the bits are “1”, the value may be present.  The probability of a false positive depends only on a few factors:

  • Size of the bitmap
  • Number of keys in the list
  • Number of bits set to “1” for each key value

An example

Let’s assume we are creating a list of names and want a bloom filter on that list. We’ll use an 8 bytes bloom filter (64 bits), and we’ll set two bits according to this function:

So the value ‘Yves’ sets the bits 15 and 39 to ‘1’.

Now, if I add 10 names (taken from the Percona slack channel “architecture_projects”) we have:

  • Alexey: 29 53
  • Alick: 8 49
  • Bennie: 31 63
  • Brenda: 20 54
  • Brett: 12 61
  • Carlos: 3 46
  • Corrado: 27 34
  • David: 15 57
  • Deb: 13 38
  • Eliana: 11 39

The bitmap now looks like:

Now, if I want to use the bitmap to “guess” if “Bennie” is in the list, that will work just fine, bits 31 and 63 are “1”. But… “Yves” is not in the list even though bits 15 (from David) and 39 (from Eliana) are set to “1”. A query for “Yves” would be a false positive.

On the other hand, let’s consider “Francisco” for which the bits are 27 and 60.  Bit 27 is set to “1” by the “Corrado” value, but bit 60 is “0”.  We thus know for sure that “Francisco” is not in the list.

Tuning

In terms of tuning, two parameters can be tuned, the size of the bitmap and the number of bits set by every value. The bitmap size must be large enough so that the proportion of bits set to “1” is not overwhelming. In the above example, if you add all the first names of the people at Percona you’ll end up with only “1”.  That would be obviously useless.  To avoid this issue, we could instead use a 1 GB bitmap. The number of false positives would be low, but 1GB is a lot of memory. In most cases a bitmap of a few hundred bytes is sufficient.

With MyRocks, you can control the size of the bloom filter bitmap with the column family parameter memtable_prefix_bloom_size_ratio. For good performance, the filter blocks are cached in the RocksDB block cache and normally stay there since they are accessed frequently.

The filter_policy variable controls the number of bits per key in the column family settings. By default, 10 bits are used per key which yields less than 1% of false positives.

Conclusion

I hope that this post has provided you with a basic understanding of bloom filters and how they are implemented in MyRocks. LSM storage engines like MyRocks are very different from the more common B-Tree-based storage engines like InnoDB. Bloom filter is an important feature improving performance of LSM storage engines. A good understanding of how bloom filters work can only lead to improved database performance.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Download Percona Distribution for MySQL Today

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments