Paul Randal

Introduction to Latches

SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Free Download

Featured Author

Itzik is a T-SQL trainer, a co-founder of SolidQ, and blogs about T-SQL fundamentals and query tuning.

Itzik’s Posts

In some of my previous articles here on performance tuning, I’ve discussed multiple wait types and how they are indicative of various resource bottlenecks. I’m starting a new series on scenarios where a synchronization mechanism called a latch is a performance bottleneck, and specifically non-page latches. In this initial post I’m going to explain why latches are required, what they actually are, and how they can be a bottleneck.

Why Are Latches Needed?

It’s a basic tenet of computer science that whenever a data structure exists in a multi-threaded system, the data structure must be protected in some way. This protection gives the following provisos:

  1. (Guaranteed) A data structure cannot be changed by a thread while another thread is reading it
  2. (Guaranteed) A data structure cannot be read by a thread while another thread is changing it
  3. (Guaranteed) A data structure cannot be changed by two or more threads at the same time
  4. (Optional) Allow two or more threads to read the data structure at the same time
  5. (Optional) Allow threads to queue in ordered fashion for access to the data structure

This can be done in several ways, including:

  • A mechanism that only ever allows a single thread at a time to have any access to the data structure. SQL Server implements this mechanism and calls it a spinlock. This allows #1, #2, and #3 above.
  • A mechanism that allows multiple threads to read the data structure at the same time (i.e. they have shared access), allows a single thread to get exclusive access to the data structure (to the exclusion of all other threads), and implements a fair way of queueing for access. SQL Server implements this mechanism and calls it a latch. This allows all five of the provisos above.

So why does SQL Server use both spinlocks and latches? Some data structures are accessed so frequently that a latch is simply too expensive and so a very lightweight spinlock is used instead. Two examples of such data structures are the list of free buffers in the buffer pool and the list of locks in the lock manager.

What is a Latch?

A latch is a synchronization mechanism that protects a single data structure and there are three broad types of latch in SQL Server:

  1. Latches protecting a data file page while it is being read from disk. These show up as PAGEIOLATCH_XX waits, and I discussed them in this post.
  2. Latches protecting access to a data file page that’s already in memory (an 8KB page in the buffer pool is really just a data structure). These show up as PAGELATCH_XX waits, and I discussed them in this post.
  3. Latches protecting non-page data structures. These show up as LATCH_SH and LATCH_EX waits.

In this series, we’re going to be concentrating on the third kind of latches.

A latch is itself a small data structure and you can think of it as having three components:

  • A resource description (of what it’s protecting)
  • A status field indicating which modes the latch is currently held in, how many threads hold the latch in that mode, and whether there are any threads waiting (plus other stuff we don’t have to be concerned with)
  • A first-in-first-out queue of threads that are waiting for access to the data structure, and which modes of access they are waiting for (called the waiting queue)

For non-page latches, we’ll confine ourselves to only considering the access modes SH (share) for reading the data structure and EX (exclusive) for changing the data structure. There are other more exotic modes, but they’re rarely used and won’t appear as contention points, so I’ll pretend they don’t exist for the rest of this discussion.

Some of you may know that there are also deeper complications around superlatches/sublatches and latch partitioning for scalability, but we don’t need to go to that depth for the purposes of this series.

Acquiring a Latch

When a thread wants to acquire a latch, it looks at the latch’s status.

If the thread wants to acquire the latch in EX mode, it can only do so if there are no threads holding the latch in any mode. If that is the case, the thread acquires the latch in EX mode and sets the status to indicate that. If there are one or more threads already holding the latch, the thread sets the status to indicate that there’s a waiting thread, enters itself at the bottom of the waiting queue, and then is suspended (on the waiter list of the scheduler it is on) waiting for LATCH_EX.

If the thread wants to acquire the latch in SH mode, it can only do so if no thread holds the latch or the only threads holding the latch are in SH mode *and* there are no threads waiting to acquire the latch. If that is the case, the thread acquires the latch in SH mode, sets the status to indicate that, and increments the count of threads holding the latch. If the latch is held in EX mode or there is one or more waiting threads, then the thread sets the status to indicate that there’s a waiting thread, enters itself at the bottom of the waiting queue, and then is suspended waiting for LATCH_SH.

The check for waiting threads is done to ensure fairness to a thread waiting for the latch in EX mode. It will only have to wait for threads holding the latch in SH mode that acquired the latch before it started waiting. Without that check, a computer science term called ‘starvation’ may occur, when a constant stream of threads acquiring the latch in SH mode prevents the EX-mode thread from ever being able to acquire the latch.

Releasing a Latch

If the thread holds the latch in EX mode, it unsets the status showing the latch is held in EX mode and then checks to see if there are any waiting threads.

If the thread holds the latch in SH mode, it decrements the count of SH-mode threads. If the count is now non-zero, the releasing thread is done with the latch. If the count *is* now zero, it unsets the status showing the latch is held in SH mode and then checks to see if there are any waiting threads.

If there are no threads waiting, the releasing thread is done with the latch.

If the head of the waiting queue is waiting for EX mode, the releasing thread does the following:

  • Sets the status to show the latch is held in EX mode
  • Removes the waiting thread from the head of queue and sets it as the owner of the latch
  • Signals the waiting thread that it is the owner and is now runnable (by, conceptually, moving the waiting thread from the waiter list on its scheduler to the runnable queue on the scheduler)
  • And it’s done with the latch

If the head of the waiting queue is waiting in SH mode (which can only be the case if the releasing thread was in EX mode), the releasing thread does the following:

  • Sets the status to show the latch is held in SH mode
  • For all threads in the waiting queue that are waiting for SH mode
    • Removes the waiting thread from the head of queue
    • Increments the count of threads holding the latch
    • Signals the waiting thread that it is an owner and is now runnable
  • And it’s done with the latch

How Can Latches Be a Contention Point?

Unlike locks, latches are generally only held for the duration of the read or change operation, so they’re pretty lightweight, but because of the SH vs. EX incompatibility, they can be just as big of a contention point as locks. This can happen when many threads are all trying to acquire a latch in EX mode (only one at a time can) or when many threads are trying to acquire a latch in SH mode and another thread holds the latch in EX mode.

Summary

The more threads in the system that are contending for a ‘hot’ latch, the higher the contention and the more negative the effect on workload throughput will be. You’ve likely heard of well-known latch contention issues, for instance around tempdb allocation bitmaps, but contention can also happen for non-page latches.

Now I’ve given you enough background to understand latches and how they work, in the next few articles I’ll examine some real-life non-page latch contention issues and explain how to prevent or work around them.