In this blog post, we will dive deep into WiredTiger’s Logging and Checkpoint mechanism.

Every database system has to ensure durability and reliability. MongoDB uses classic Write-Ahead-Logging (WAL) using Journals and Checkpoints.

Starting with the basics, why is WAL needed in the first place? It’s to ensure that our data is durable after each write operation and to make it persistent and consistent without compromising the performance.

In terms of MongoDB, it achieves WAL and data durability using a combination of both Journaling and Checkpoints. Let’s understand both of them.

1. Journal

It’s a process where every write operation gets written (appended) from Memory to a Journal file, AKA transaction log that exists on disk at a specific interval configured using “journalCommitIntervalMs.”

This acts as a step to ensure durability by recovering lost data from the same journal files in case of crashes, power, and hardware failures between the checkpoints (see below)

Here’s what the process looks like.

  • For each write operation, MongoDB writes the changes into Journal files, AKA transaction log files, which is basically a WAL mechanism used by MongoDB, as discussed above. This happens at every journalCommitIntervalMs.
  • The same data, in the form of pages inside the Wiredtiger cache, are also marked dirty.

Example of journal file when exported using WiredTiger Binary (wt):

The important part of it is the byte and offset, which contains any data modifications that happened. 

2. Checkpoint

The role of a checkpoint in durability and consistency is equally important. A checkpoint is equivalent to a log, which records the changes in related data files after the last checkpoint.

Each checkpoint consists of a root page, three lists of pages pointing to specific locations on the disk, and the file size on the disk.

At every checkpoint interval (Default 60 seconds), MongoDB flushes the modified pages that are marked as dirty in the cache to their respective data files (both collection-*.wt and index-*.wt).

Using the same “wt” utility, we can list the checkpoints and view the information they contain. The checkpoint information shown below is stored with respect to each data file (collection and index). These checkpoints are stored in WiredTiger.wt file.

wiredtiger cache mongodb

The above diagram shows the information present in a checkpoint, while the below output shows how it looks when listed using the “wt” utility.

This key information resides inside each checkpoint and consists of the following:

  • Root page:
    • Contains the size (size) of the root page, the position in the file (offset), and the checksum (checksum). When a checkpoint is created, a new root page will be generated.
  • Internal page:
    • Only carries the keys. WiredTiger traverses through internal pages to look for the respective Leaf page.
  • Leaf page:
    • Contains actual key: value pair
  • Allocated list pages: 
    • After the recent checkpoint, WiredTiger block manager keeps a record of newly allocated pages and their information, such as size, offset, and checksum.
  • Discarded list pages: 
    • Upon completion of the last checkpoint, associated pages will be discarded; however, key information such as size, offset, and the checksum of each such discarded page will be stored.
  • Available list pages: 
    • When this checkpoint is executed, all pages allocated by the WiredTiger block manager but not yet used; when deleting a previously created checkpoint, the available pages attached to it will be merged into the latest available list of this checkpoint, and also the size, offset, and checksum of each available page will be recorded.
  • File size:
    • Information about the size of a data file on disk upon completion of a checkpoint.

Although both the processes (involving disk) might look the same, they have different purposes. Journal, on the one hand, is an append-only operation in a journal file, AKA transaction log file present on disk. Checkpoints, on the other hand, deal with persisting the data on respective data files, which does include a lot of overhead due to the complexity involved, especially random disk operations and reconciliation.

Generally, the checkpoint is triggered.

  • At every 60 seconds (default), unless there’s a large amount of data that needs to be written, which creates a backlog due to I/O bottlenecks. 
  • When eviction_dirty_target or eviction_dirty_trigger reaches 5% and 20%, respectively. However, it’s not normal and only happens when there’s too much write activity beyond what the hardware can handle.

So, what happens when there’s an unexpected crash or hardware failure? Let’s take a look at the process when we start mongod.

  1. MongoD attempts to go into crash recovery and looks for anything there in the Journal files.

The trimmed output would look something like the one below in the “mongod log” files. 

      2. Identifies the last successful checkpoint from the data files and recovers the uncommitted dirty data from the journal files back into the WireTtiger cache. The same pages will then again be marked as dirty.

Output is trimmed to show relevant information only.

3. This dirty page’s data will then again be ready to be flushed out during the next checkpoint to their respective data files on disk. This is handled by “WiredTiger Block Manager.” Unwanted journal entries will be then cleaned up accordingly post-checkpoint execution.

Voila!! We now have a durable and consistent data state even after a crash.

References: 

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

Download Percona Distribution for MongoDB Today!

Subscribe
Notify of
guest

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Gaurav

Thanks Divyanshu for wonderful article.