Paul White

How Parallel Plans Start Up – Part 4

SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Free Download

Featured Author

Paul White is an independent SQL Server consultant specializing in performance tuning, execution plans, and the query optimizer.

Paul’s Posts

This is the fourth part in a five-part series taking a deep dive into the way SQL Server row mode parallel plans start executing. Part 1 initialized execution context zero for the parent task, and part 2 created the query scan tree. Part 3 started the query scan, performed some early phase processing, and started the first additional parallel tasks in branch C.

Branch C execution details

This is the second step of the execution sequence:

  1. Branch A (parent task).
  2. Branch C (additional parallel tasks).
  3. Branch D (additional parallel tasks).
  4. Branch B (additional parallel tasks).

A reminder of the branches in our parallel plan (click to enlarge)

Parallel branches

A short time after the new tasks for branch C are queued, SQL Server attaches a worker to each task, and places the worker on a scheduler ready for execution. Each new task runs inside a new execution context. At DOP 2, there are two new tasks, two worker threads, and two execution contexts for branch C. Each task runs its own copy of the iterators in branch C on its own worker thread:

Branch C

The two new parallel tasks start running at a sub-procedure entry point, which initially leads to an Open call on the producer side of the exchange (CQScanXProducerNew::Open). Both tasks have identical call stacks at the start of their lives:

Parallel task initial call stack

Exchange synchronization

Meanwhile, the parent task (running on its own worker thread) registers the new sub-processes with the sub-process manager, then waits at the consumer side of the repartition streams exchange at node 5. The parent task waits on CXPACKET* until all of the branch C parallel tasks complete their Open calls and return to the producer side of the exchange. The parallel tasks will open every iterator in their subtree (i.e. down to the index seek at node 9 and back) before returning to the repartition streams exchange at node 5. The parent task will wait on CXPACKET while this happens. Remember the parent task is executing early phases calls.

We can see this wait in the waiting tasks DMV:

Waiting tasks

Execution context zero (the parent task) is blocked by both of the new execution contexts. These execution contexts are the first additional ones to be created after context zero, so they are assigned the numbers one and two. To emphasise: Both new execution contexts need to open their subtrees and return to the exchange for the parent task’s CXPACKET wait to end.

You might have been expecting to see CXCONSUMER waits here, but that wait is reserved for waiting on row data to arrive. The current wait is not for rows — it is for the producer side to open, so we get a generic CXPACKET* wait.

* Azure SQL Database and Managed Instance use the new CXSYNC_PORT wait instead of CXPACKET here, but that improvement hasn’t made its way into SQL Server yet (as of 2019 CU9).

Inspecting the new parallel tasks

We can see the new tasks in the query profiles DMV. Profiling information for the new tasks appears in the DMV because their execution contexts were derived (cloned, then updated) from the parent (execution context zero):

New tasks

There are now three entries for each iterator in Branch C (highlighted). One for the parent task (execution context zero), and one for each new additional parallel task (contexts 1 and 2). Notice that the per-thread estimated row counts ( see part 1) have arrived now, and are shown only for the parallel tasks. The first and last active times for the parallel tasks represent the time their execution contexts were created. None of the new tasks has opened any iterators yet.

The repartition streams exchange at node 5 still only has a single entry in the DMV output. This is because the associated invisible profiler monitors the consumer side of the exchange. The additional parallel tasks are on the producer side of the exchange. The consumer side of node 5 will eventually have parallel tasks, but we haven’t got to that point yet.

Checkpoint

This seems like a good point to pause for breath and summarize where everything is at the moment. There will be more of these stopping points as we go along.

  • The parent task is on the consumer side of the repartition streams exchange at node 5, waiting on CXPACKET. It is in the middle of executing early phases calls. It paused to start up Branch C because that branch contains a blocking sort. The parent task’s wait will continue until both parallel tasks complete opening their subtrees.
  • Two new parallel tasks on the producer side of the node 5 exchange are ready to open the iterators in Branch C.

Nothing outside Branch C of this parallel execution plan can make forward progress until the parent task is released from its CXPACKET wait. Remember we have only created one set of additional parallel workers so far, for Branch C. The only other thread is the parent task, and that is blocked.

Branch C Parallel Execution

Branch C

The two parallel tasks start at the producer side of the repartition streams exchange at node 5. Each has a separate (serial) plan with its own stream aggregate, sort, and index seek. The compute scalar does not appear in the runtime plan because its calculations are deferred to the sort.

Each instance of the index seek is parallel-aware and operates on disjoint sets of rows. These sets are generated on demand from the parent rowset created earlier by the parent task (covered in part 1). When either instance of the seek needs a new subrange of rows, it synchronizes with the other worker threads, so that only one is allocating a new subrange at the same time. The synchronization object used was also created earlier by the parent task. When a task waits for exclusive access to the parent rowset to acquire a new subrange, it waits on CXROWSET_SYNC.

Branch C tasks open

The sequence of Open calls for each task in Branch C is:

  • CQScanXProducerNew::Open. Notice there is no preceding profiler on the producer side of an exchange. This is unfortunate for query tuners.
  • CXTransLocal::Open
  • CXPort::Register
  • CXTransLocal::ActivateWorkers
  • CQScanProfileNew::Open. The profiler above node 6.
  • CQScanStreamAggregateNew::Open (node 6)
  • CQScanProfileNew::Open. The profiler above node 7.
  • CQScanSortNew::Open (node 7)

The sort is a fully blocking operator. It consumes its entire input during its Open call. There are a great number of interesting internal details to explore here, but space is short, so I will only cover the highlights:

The sort builds its sort table by opening its subtree and consuming all the rows its children can provide. Once sorting is complete, the sort is ready to transition to output mode, and it returns control to its parent. The sort will later respond to GetRow() calls, returning the next sorted row each time. An illustrative call stack during sort input is:

Sort input call stack

Execution continues until each sort has consumed all the (disjoint ranges of) rows available from its child index seek. The sorts then call Close on the index seeks, and return control to their parent stream aggregate. The stream aggregates initialize their counters and return control to the producer side of the repartition exchange at node 5. The sequence of Open calls is now complete in this branch.

The profiling DMV at this point shows updated timing numbers, and close times for the parallel index seeks:

Profiling output

More exchange synchronization

Recall the parent task is waiting on the consumer side of node 5 for all producers to open. A similar synchronization process now happens among the parallel tasks on the producer side of the same exchange:

Each producer task synchronizes with the others via CXTransLocal::Synchronize. The producers call CXPort::Open, then wait on CXPACKET for all consumer-side parallel tasks to open. When the first Branch C parallel task arrives back at the producer side of the exchange and waits, the waiting tasks DMV looks like this:

Waiting tasks

We still have the parent task’s consumer-side waits. The new CXPACKET highlighted is our first producer-side parallel task waiting for all consumer-side parallel tasks to open the exchange port.

The consumer-side parallel tasks (in Branch B) do not even exist yet, so the producer task displays NULL for the execution context it is blocked by. The task currently waiting on the consumer side of the repartition streams exchange is the parent task (not a parallel task!) running EarlyPhases code, so it doesn’t count.

Parent task CXPACKET wait ends

When the second parallel task in Branch C arrives back at the producer side of the exchange from its Open calls, all producers have opened the exchange port, so the parent task on the consumer side of the exchange is released from its CXPACKET wait.

The workers on the producer side continue to wait for the consumer side parallel tasks to be created and open the exchange port:

Waiting tasks

Checkpoint

At this point in time:

  • There are a total of three tasks: Two in Branch C, plus the parent task.
  • Both producers at the node 5 exchange have opened, and are waiting on CXPACKET for the consumer side parallel tasks to open. Much of the exchange machinery (including row buffers) is created by the consumer side, so there is nowhere for the producers to put rows yet.
  • The sorts in Branch C have consumed all their input, and are ready to provide sorted output.
  • The index seeks in Branch C have completed their work and closed down.
  • The parent task has just been released from waiting on CXPACKET at the consumer side of the node 5 repartition streams exchange. It is still executing nested EarlyPhases calls.

Branch D Parallel Tasks Start

This is the third step in the execution sequence:

  1. Branch A (parent task).
  2. Branch C (additional parallel tasks).
  3. Branch D (additional parallel tasks).
  4. Branch B (additional parallel tasks).

Released from its CXPACKET wait at the consumer side of the repartition streams exchange at node 5, the parent task ascends the Branch B query scan tree. It returns from nested EarlyPhases calls to the various iterators and profilers on the outer (upper) input of the merge join.

As mentioned, ascending the tree updates the elapsed and CPU times recorded by the invisible profiling iterators. We are executing code using the parent task, so those numbers are recorded against execution context zero. This is the ultimate source of the “thread 0” timing numbers referred to in my previous article, Understanding Execution Plan Operator Timings.

Once back at the merge join, the parent task calls EarlyPhases for the iterators and profilers on the inner (lower) input to the merge join. These are nodes 10 to 15 (excluding 14, which is deferred):

Plan with node IDs

Once the parent task’s early phases calls reach the index seek at node 15, it begins to ascend the tree again (setting profiling times) until it reaches the repartition streams exchange at node 11.

Then, just as it did on the outer (upper) input to the merge join, it starts the producer side of the exchange at node 11, creating two new parallel tasks.
This sets Branch D in motion (shown below). Branch D executes exactly as already described in detail for Branch C.

Branch D

Immediately after starting tasks for Branch D, the parent task waits on CXPACKET at node 11 for the new producers to open the exchange port:

Waiting tasks

The new CXPACKET waits are highlighted. Notice the reported node id might be a bit misleading. The parent task really is waiting at the consumer side of node 11 (repartition streams), not node 2 (gather streams). This is a quirk of early phase processing.

Meanwhile, the producer threads in Branch C continue to wait on CXPACKET for the consumer side of the node 5 repartition streams exchange to open.

Branch D opening

Just after the parent task starts the producers for Branch D, the query profile DMV shows the new execution contexts (3 and 4):

Branch D profiles

The two new parallel tasks in Branch D proceed exactly as those in Branch C did. The sorts consume all their input, and the Branch D tasks return to the exchange. This releases the parent task from its CXPACKET wait. The Branch D workers then wait on CXPACKET at the producer side of node 11 for the consumer side parallel tasks to open. Those parallel workers (in Branch B) do not exist yet.

Checkpoint

The waiting tasks at this point are shown below:

Waiting tasks

Both sets of parallel tasks in Branches C and D are waiting on CXPACKET for their parallel task consumers to open, at repartition streams exchange nodes 5 and 11 respectively. The only runnable task in the whole query right now is the parent task.

The query profiler DMV at this point is shown below, with operators in Branches C and D highlighted:

Profile DMV output

The only parallel tasks we haven’t started yet are in Branch B. All the work in Branch B so far has been early phases calls performed by the parent task.

End of Part 4

In the final part of this series, I will describe how the remainder of this particular parallel execution plan starts up, and briefly cover how the plan returns results. I will conclude with a more general description that applies to parallel plans of arbitrary complexity.