Heap Snapshots–Tracing All Heap Allocations

I’ve recently started using heap snapshots on Windows to track heap allocations. I was able to use heap snapshots to record call stacks for all outstanding allocations in Chrome’s browser process over a full two weeks, letting me account for pretty much every byte of memory consumed.

uxtheme.dll with its own heap inside chrome.exeSince then I have used heap snapshots to find wasteful memory usage in the Windows heap, a memory leak in a security tool injected into Chrome, and many details of Chrome’s memory usage that I was not previously aware of.

I first read about heap snapshots here. This page gives the mechanics of how to record a heap snapshot but it spends very little time explaining what heap snapshots are or how to use them effectively.

This post is mostly tutorial, rather than investigative reporting, but I hope it will be helpful to some and it does discuss some discoveries that I made along the way. Any time I look at Chrome through a different lens I find something.

What heap snapshots are

A heap snapshot records call stacks for every outstanding heap allocation in a process. The call stacks are stored in memory (thus increasing the memory footprint of the process being investigated) and can be dumped to an ETW trace file (.etl file) whenever desired – thus giving you a snapshot of the heap. This trace can be explored using Windows Performance Analyzer (WPA) with all of the power of its summary tables. This means that you can group the allocations by stack, heap, size, etc. in order to explore the allocations. It also means that you can use the TraceProcessor package to programmatically analyze the allocation data.

The key thing about heap snapshots is that they only retain data for outstanding allocations – memory that has not yet been freed. This is different from ETW Heap Tracing which retains information about every allocation, freed or not. This means that heap snapshots are great for finding leaks in long running programs. And, because they instrument all Windows heaps in the process they can track allocations by uxtheme.dll and other DLLs that like creating their own heaps.

How to use them

The process of recording heap snapshots is basically three steps:

  1. Enable heap snapshots for a particular process
  2. Record a heap snapshot – repeat this as many times as desired
  3. Disable heap snapshots for that process

In more detail these steps are:

    1. Enable heap snapshots for a particular process
      1. wpr -snapshotconfig heap -pid %pid% enable
    2. Record a heap snapshot – repeat this as many times as desired
      1. wpr -start heapsnapshot -filemode
      2. wpr -singlesnapshot heap %pid%
      3. wpr -stop “%temp%\UIforETW_heap_snapshot.etl”
      4. xperf -merge “%temp%\UIforETW_heap_snapshot.etl” “%FileName%” -compress
    3. Disable heap snapshots for that process
      1. wpr -snapshotconfig heap -pid %pid% disable

There are other ways of using them but this is what I have been doing.

That’s it. But instead of typing those commands in I recommend you use etwheapsnapshot.bat which is available on github and is distributed with the latest UIforETW release. If you use that batch file (must be run from an administrator command prompt) then you can just pass it a Process ID (PID) and then record as many heap snapshots as you want. Each one will contain all outstanding allocations from when heap snapshots were enabled until when the snapshot is recorded.

It is sometimes nice to start recording heap allocations before a process does its first allocation. You can do this by heap snapshotting by process name, which will enable it on process start for all processes with that name. However that will enable it too broadly in the case of Chrome. I only want to trace the main (browser) process but there are potentially dozens of chrome.exe processes that would all end up with heap snaphotting enabled. So, I wrote DelayedCreateProcess. This tool simply creates a process in a suspended state, prints the PID, waits ten seconds, and then lets the process start running. This gives me ten seconds to type in the PID in another (elevated) command prompt so that the heap snapshots record every allocation. This dance is shown here. After setting up the windows I just typed <Enter> to run DelayedCreateProcess.exe, then switched windows and type “14920<Enter>”, copied from the DelayedCreateProcess.exe output:

DelayedCreateProcess to get the PID

Using DelayedCreateProcess.exe is optional – sometimes you don’t need all allocations from the beginning of time.

If trace compression is used (which my batch file does) then the snapshot traces are tiny. My two-week Chrome browser process trace included data for more than 1.8 million allocations and was just 26 MB.

Heap allocations are much slower when heap snapshots are enabled. For most of Chrome this didn’t matter, but a few operations did enough allocations to make them slow enough that (of course) I grabbed a regular ETW trace to investigate.

Exploring traces

When you load a heap snapshot trace into WPA the graphs and tables from the default profile will mostly be empty – no CPU usage information is recorded, for instance. The allocation data is found in a special-purpose table. In graph explorer you can find it in the Memory section, however I would recommend applying the profile that ships with UIforETW. The default profile… has some issues. The default columns for Heap Snapshots are shown below:

Default heap snapshot columns

The presence of the Instance, Process, and PID columns suggest an assumption that users will typically be analyzing multiple heap snapshot traces (Instance column) which each contain data from multiple processes (Process column) with a purely redundant PID column, where the size of pointers is somehow a useful way of aggregating data. So far so mystifying.

  • Stack ID is the next mystery. This column isn’t documented (the tooltip is just Stack ID) and it does not seem to be have a stable correspondence with call stacks across traces so… I don’t understand.
  • Stack Ref# seems to correspond to a stack signature so it is potentially an interesting column to group by. But it’s not much use on the right-hand side of the orange bar.
  • Ref Count seems to be a copy of the Count column but with the aggregation set incorrectly so that it is blank when collapsed, rendering it useless.
  • Size (B) (how many allocated bytes are collapsed into the current row) is a very useful column!
  • Stack Depth seems like it has marginal value.
  • Count (how many allocations are collapsed into the current row) is a very useful column!

One notably absent column is the Stack column. Isn’t the whole point of heap snapshots to get a call stack for every allocation? That is my assumption so I think it is odd that the Stack column is off by default. Weirder yet is that when you go to add the Stack column it seems to be missing from the list of available columns. Wot is up?

Open View EditorIt turns out that the stack column is hidden by default – omitted from the list of available columns. If you want to find it you need to go to the view editor. To do this you can click on the gear icon or type Ctrl+E. This brings up the full list of columns. You will almost certainly want to drag the Stack column over to the right. I think that the Heap Handle column can be helpful as well, as can the Annotation column (annotate allocations to group them by type) and the Stack Ref# column (unique for each stack), which gives us this set of columns:

Preferred heap snapshot columns

This is the set of columns I put in the HeapSnapshot.wpaProfile. Some of them are unnecessary for some types of investigations but I think it is easier to hide unneeded columns than to have to hunt down possibly useful ones. Here we can see seven different heaps in Chrome’s browser process, with 13 MB devoted to the safe browsing database:

Heap snapshot with safe browsing visible

One technique which is often worth trying with heap snapshots (and with regular heap traces) is flipping the call stacks. All of the stacks end at RtlpAllocateHeapInternal so you can drill down to that function on any stack, right-click, and select View Callers-> By Function:

View Callers-> By Function

This gives you call stacks that branch out from RtlpAllocateHeapInternal. This different view of the stacks makes different patterns jump out – in this case the significant amount of memory devoted to std::string allocations is quite obvious:

Lots of string allocations

Note that heap snapshot tracing is best done on 64-bit processes. 32-bit processes may hit out-of-memory failures due to the extra metadata recorded, and the PID method of enabling tracing doesn’t actually work with 32-bit processes on a 64-bit process.

Other possibilities

  • You could put multiple snapshots into one ETW file. I don’t think that’s worthwhile because you end up with a lot of less useful data from the time period in-between.
  • You could record CPU/disk utilization information into the heap snapshot file. I don’t think that’s worthwhile because you end up with snapshots (point in time) in the same trace as data over time. But, it is certainly possible if you have a need for it.
  • You could start tracing by process name.That could be more convenient, but it doesn’t work well for my purposes with tracing Chrome’s memory allocations.

Discoveries

You can see my learning curve with heap snapshots on this Chromium bug. I guess it’s interesting if you want to see how the sausage is made, but you’ll see a lot of false steps and random ramblings in there. So, I’ll summarize the key findings:

  • WebRTC logging can use 12 MB of memory, which seems like a lot.
  • A customer who was complaining about memory leaks in Chrome’s browser process turned out to be running a leaky version of Cisco’s Morphisec Endpoint Threat Prevention. The leaking of ~5 MB/hour was apparently fixed some time earlier this year.
  • The Morphisec leak showed up as hundreds of thousands of allocations from a separate heap, created by Protector64.dll. If we had a chrome://heaps page that showed the sizes of the various heaps then third-party bugs like this might be easier for users to find.
  • Each Chrome process creates two heaps, rather than using the default process heap. Each heap comes with some overhead so this may be wasting memory.
  • On my workstation (your mileage may vary) the Windows heap seems to be tuned to trade memory for performance. That is, it allocates additional per-thread lists of free memory in order to maximize parallelism, but this means (again, on my multi-core many-GB workstation) that more than half of the heap memory is not actually in use. The “Segment Heap” is an off-by-default alternative heap that might improve this significantly.
  • Some bugs in windbg’s !heap implementation were found – I think that fixes will be released in the next version.
  • Memory usage in Chrome’s browser process increases over time (storing the history of open and closed tabs, for instance) but I saw no clear signs of leaks.
  • Typing into Chrome’s omnibox is sluggish when heap snapshots are enabled, indicating that many allocations are being done. Tracing showed that many of the allocations come from searching to see if the URL being typed is open in a tab. This could probably be improved.
  • Chrome switched from the Windows heap to Partition Alloc in early 2021, which means that heap snapshots no longer work for Chrome’s allocations. If you build Chrome from source you can re-enable the Windows heap by setting this gn arg: use_allocator = “none”

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Documentation, Performance, Programming, uiforetw, xperf and tagged , . Bookmark the permalink.

23 Responses to Heap Snapshots–Tracing All Heap Allocations

  1. Zhentar says:

    That awful default view looks like it’s designed for Microsoft developers debugging heap tracing. The ‘Stack ID’ just relates to how raw heap allocation events map to the raw stack trace events in the ETL file. It is completely meaningless beyond that, and there’s no reason to show it to users.

  2. akraus1 says:

    Do you know with which Windows Version Heap Snapshots work? The default views of WPA are really crap.

    • Zhentar says:

      I’m pretty sure it was part of all the Windows 8 changes (in particular I suspect stack caching was designed to support it).

      • brucedawson says:

        Thanks for the information. I’m running Windows 10 1903 with WPT 1903 and I don’t know when it actually started working. I know that the initial Windows 10 WPT didn’t support them, so make sure you’re using the latest version of WPT, which is installed by the latest UIforETW or the Windows 10 1903 SDK.

  3. LukasT says:

    Tried to follow the tutorial, somehow the bat file bite me.

    > @rem Note: this probably fails in some locales. Sorry.

    I was hit by the locale bug in the etwheapsnapshot.bat.
    date/T outputs something like this on my Slovak locale:
    pi 22. 11. 2019

    Thanks for the blog, I managed to get the heap snapshots in the end!

  4. ryanmolden says:

    Ideas on this error?

    C:\WINDOWS\system32>wpr -stop “UIforETW_heap_snapshot.etl”

    Cannot change thread mode after it is set.

    Profile Id: RunningProfile

    Error code: 0x80010106

  5. Hamlet says:

    Excuse me, Which Apps are suitable for Investigating nowadays on Windows 10 with wpa/wpr? that may get the real issue e.g. CPUsaturation, Memory leak detection etc.

    • brucedawson says:

      In the context of heap snapshots, any app that does heap allocations is suitable for investigating, and that means all apps except for those that use a custom memory allocator.
      In the broader context of performance investigations you can use WPA/WPR/ETW to investigate any Windows performance issue.
      You will get better answers with more specific questions. Try looking at tinyurl.com/etwcentral first since it has more overview links.

      • Hamlet says:

        Sorry, do you have a demo(Video) for the above-related work anywhere?
        Also, Is there a way to capture that issue in the Perfmon tool (which counters/Instance related are suitable)? Thanks!

        • brucedawson says:

          I don’t have any videos demonstrating recording and analysis of heap snapshots.
          This post doesn’t talk about permon and counters. Are you thinking of a different blog post perhaps?

          • Hamlet says:

            Hello Bruce,
            I’m trying to see if you have a blog or any, about predicting service/process/event failure/issue from series of events over time? and maybe of using WPA or even Statistics methods for that prediction!
            Thank you.

  6. pongba says:

    Hi Bruce,
    I followed all (4) steps (and no error message along the way) here: https://docs.microsoft.com/en-us/windows-hardware/test/wpt/record-heap-snapshot, but the .etl file doesn’t contain “Memory” section (just System Activity, Power, and Other). Do you happen to know why? On another machine it works, however. Same procedure. Looked everywhere and couldn’t find any answer .

    • brucedawson says:

      I never use wpr so I can’t really troubleshoot issues with it. I guess you should do it on the machine that works, or share more information about what is different about the machine where it doesn’t work. Or use the heap-snapshot batch files which I explain here and see if they work any better.

      • pongba says:

        Thanks Bruce for the quick reply. It appeared that it was likely due to the OS being of a too old version (Windows Server 2016) (the heapsnapshot seems to first appear as part of RedStone 4). I’ll upgrade the OS and see if it works. Will update my findings here once I conclude it.

  7. WL says:

    Hi Bruce, another issue that I’ve found is the overhead. Our service is a 30+GB, allocation intensive online service. Enabling heapsnapshot capturing causes the CPU % to jump from 15% to >80% (persistent) on all cores the service uses. This huge overhead delays every incoming request processing so much that it basically renders snapshot useless for memory leak investigation, since at the time of snapshot so many request are still outstanding, there’re too many false positive outstanding allocations in the snapshot. Have you run into similar situation and if so what’s your way out of it?

    • brucedawson says:

      I noticed that some actions in Chrome were slow when recording heap snapshots so I profiled those actions and filed bugs to report the excessive heap allocations. Some of those bugs got fixed, which made heap snapshots more usable.
      Your options are really pretty basic: either live with the slowdown, or reduce the allocation rate either by optimizing the code (side benefit!) or by reducing the load on the server.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.