Delete an inline function, save 794 kB

Posted on January 22, 2017 by brucedawson

In the previous episode of “Simple Changes to Shrink Chrome” I discussed how deleting ‘const’ from a few key locations could lead to dramatic size savings, due to a VC++ compiler quirk. In this episode I’ll show how deleting an inline function definition can lead to similar savings.

The savings this time are less important as they are mostly in the .BSS segment, but there are also some modest code-size savings, and some interesting lessons. It’s worth confessing up front that this time the problem being solved was not caused by a compiler quirk – it was entirely self inflicted.

Doing this investigation has reminded me that the behavior of linkers is best described by chaos theory – the details of their behavior defy prediction by simple heuristics, and the results can be changed dramatically by tiny ~~butterflies~~ code changes.

In August of 2016 Chrome switched meta-build systems, from gyp to gn. Both gyp and gn are systems for creating ninja files and these are the files that define how Chrome is built. Changing from gyp to gn was a huge project because of the thousands of assumptions that were embedded in Chrome’s many .gyp files, but the change was necessary – gyp was showing its age.

Part of my contribution was to help investigate binary size regressions and one technique I used was to export symbol lists and see which ones appeared in gn builds but not gyp builds. Here are the six largest globals in chrome.dll from a gyp build near the transition:

gyp:
122784 kBrotliDictionary
65536 jpeg_nbits_table
54904 propsVectorsTrie_index
47152 device::UsbIds::vendors_
44815 kNetworkingPrivate
39640 propsTrie_index

And here are the six largest globals from a similar gn build:

gn:
131072 ff_sin_65536
131072 ff_cos_65536
122784 kBrotliDictionary
65536 jpeg_nbits_table
65536 ff_sin_32768
65536 ff_cos_32768

The ff_sin* and ff_cos* globals (which go from 65536 entries down to 16) are from ffmpeg, and since they weren’t in the gyp build they clearly aren’t supposed to be in chrome.dll. A bit of investigation using linker_verbose_tracking.py led to this change which switched some source_set targets (lists of object files to be linked) to static_library targets, thus telling the linker to be less aggressive about pulling in their code. This worked, I closed the bug, and got on with my life.

Are you sure?

Until I noticed that the arrays were still there.

I had made the amateur mistake of failing to verify my results on Chrome’s official build. The fix should have worked on those builds as well, but Murphy’s Law doesn’t care about ‘should’.

So, I repeated my investigation using the same tools, but now using Chrome’s official build. This was a painful process. Chrome’s official build uses Link Time Code Generation (also known as Link Time Optimization) which means that every change, no matter how small, regenerates all of the code for Chrome’s enormous DLLs. My iteration times went from a couple of minutes to about an hour. Uggh.

Chrome has a “component” build configuration which can give incremental times as short as a few seconds, but this configuration is not helpful for this type of investigation.

But, it’s just computer time. I filled in the gaps with other projects. The bigger problem was that my technique was not working. No matter how many source_set targets I changed to static_library targets the linker kept finding multiple reasons to pull in ffmpeg.

My Python script works by scanning the verbose linker output and printing the chain of dependencies that lead to a particular object file (in this case rdft.obj, the one that defines the arrays) being pulled in. I decided that I needed to modify my tool to print out why each object file was being pulled in. So, I changed it to list which symbol caused a particular object file to be pulled in and I started seeing results like this:

rdft.obj referenced by
        avfft.obj for symbol ff_rdft_init

avfft.obj referenced by
        FFTFrameFFMPEG.obj for symbol av_rdft_calc

FFTFrameFFMPEG.obj referenced by
        PeriodicWave.obj for symbol blink::FFTFrame::doInverseFFT

PeriodicWave.obj referenced by
        SkGeometry.obj for symbol log2f

Looking at a few sets of these results I noticed some strange things. The crucial object file seemed to always be PeriodicWave.obj, and it was always being pulled in because some other object file needed a copy of log2f. This was peculiar because PeriodicWave.obj is a Blink (née WebKit) file, which shouldn’t be needed in chrome.dll. It’s also peculiar because log2f is a C Run-Time (CRT) function, which shouldn’t be supplied by Blink anyway.

PeriodicWave.cpp didn’t contain a definition for log2f which left me confused for a while but I eventually put the pieces together:

MathExtra.h defined inline log2 and log2f function
PeriodicWave.cpp included MathExtra.h and contained some calls to log2f
When VC++ notices a call to an inline function it generates a non-inline version of that function, in case it doesn’t get inlined
That non-inline version of log2f happened to get pulled in by the official builds which then pulled in the rest of PeriodicWave.obj, and the /OPT:REF optimization that is supposed to discard the unused pieces was unable to get rid of them

The definition of log2 and log2f in MathExtras.h were only ever intended to work around their absence in VC++ and Android. Since VC++ now had those functions the inline definitions were unnecessary, and apparently actively harmful. So I landed a pair of changes that removed them, first just for Windows and then for Android:

Don’t define log2 and log2f in blink (Windows only)

Don’t define log2 and log2f in blink for Android

The one remaining mystery is why this only happened in the official builds. I don’t know the answer with 100% certainty, but I know enough about how symbol resolution works in linkers to have some ideas. Linkers start out with the set of object files that you have specified on the command line – the contents of these object files will be included in the final binary (unless the linker can prove that removing them makes no difference). Then the linker builds up a list of unresolved symbols and starts scanning through the set of libraries that were specified on the command line – the object files in these libraries will be included as-needed. Whenever a needed symbol is found in one of the library object files then that object file is added to the list to be linked – it becomes like one of the command-line object files. The process continues until all symbols are resolved.

The symbol search order is unspecified by C++ and due to the One Definition Rule (ODR) the search order should not matter. But, an obvious implementation is to repeatedly scan through the libraries, looking for symbols, until there are no more unresolved. This means that if you have two copies of the same symbol then it is undefined which you will get – it depends where you are when you first find that you need it. Since pulling in a symbol pulls in the rest of its object file this means that linking is fundamentally unpredictable. And, if you have ODR violations (as we did) then the results are not even defined.

So, the official builds perturbed the search algorithm enough that a different copy of log2f was found, and that led to unintended consequences.

Ideally the linker would notice the ODR violation, but doing so ends up being expensive – you have to look into every object file to see if any of functions are defined in different ways, and the definition of ‘different’ is not obvious. There are good reasons that the C++ standard does not require diagnostics for ODR violations.

I wrote about non-inlined inline functions and ODR violations in Chrome barely a month before this article, here.

Ideally chrome.dll wouldn’t even list Blink as something to be linked with. Another Chrome developer has since fixed that.

I still feel bad about not noticing that the arrays were in our official builds after I “fixed them” the first time. Luckily I’m the sort of developer who randomly downloads symbols for Chrome’s canary builds and runs analysis tools on them, so I eventually redeemed myself.

If you’re the sort of developer who likes browsing Chrome’s source then you can see the (tiny) change for Windows by typing “> git show 036e2ce” in a Chromium repo. A more accessible version can be found here.

The net effect of this change on the size of chrome.dll’s sections (in-memory size) was:

chrome.dll
     .text:     -6656 bytes change
    .rdata:       240 bytes change
     .data:   -786624 bytes change
   .rodata:      -592 bytes change
    .reloc:      -872 bytes change
Total change: -794504 bytes

More details of the techniques used can be found here. If you want to poke at some relevant chrome.dll PDBs then you can use retrievesymbols from UIforETW to grab them and then run SymbolSort or ShowGlobals on them. Just set _NT_SYMBOL_PATH to something like SRV*c:\symbols*https://msdl.microsoft.com/download/symbols and then run one of these:

> retrievesymbols B6A75B4A5FE4446CAB6AE159B22DD91F 1 chrome.dll.pdb – gyp build (September 24th, 2016)
> retrievesymbols 0D12FEA487E942BB8719672D45464C51 1 chrome.dll.pdb – gn build (October 30th, 2016)
> retrievesymbols C39CCDB899B34A7CBCC25F51E64B1F62 1 chrome.dll.pdb – recent gn build (January 13, 2017)

These weren’t the PDBs used for the original investigation, but they show the issue.

Reddit discussion is here and here, twitter discussion is here, hacker news discussion is here.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048

View all posts by brucedawson →

This entry was posted in Performance, Programming, Visual Studio and tagged inline, linking, VC++. Bookmark the permalink.

13 Responses to Delete an inline function, save 794 kB

Peter Bindels says:

January 23, 2017 at 12:26 am

> Since pulling in a symbol pulls in the rest of its object file this means that linking is fundamentally unpredictable.

On most platforms all symbols are each separated into their own function or data section, with per-section relocations as needed. This only will pull in the symbol itself and its relocations. Sadly two caveats still apply:

– This only works if your object format has sufficient space for it. Not all ELF tools can handle >64k sections yet, and Mach-O is IIRC unable to do this at all.
– If the object file holds globals, those cannot be excluded as they may be needed in the function that was pulled in – and you also get whatever is referenced from those global’s constructors.

You get this link-instability from something as simple as `assert(log2f(0x1245) == 0.87f);` – this does generate a link reference in debug, but does not in release, causing wholly different pull-in chains to be used. Linkers would be more stable if they pulled in everything from the first definition of that symbol they saw – but sadly, linkers currently don’t do that.

Reply
- brucedawson says:
  
  January 23, 2017 at 8:45 am
  
  How sure are you about that linker behavior? I could run tests but they tend to be very tricky.
  
  With VC++ the usual behavior is to have each symbol in their on “COMDAT”, to make it easy for the linker to discard unused components of an object file. Despite that the VC++ linker brings in the entire object file and then discards parts of it, rather than bringing in just one COMDAT. So, separate sections for each symbol aren’t necessarily sufficient, and I’m wondering what references or experience you have for your assertion – I’m curious.
  
  The problem of globals with constructors is significant. Those need to be pulled in even if they are unreferenced – many programmers depend on that, often more than they realize.
  
  Reply
  - Peter Bindels says:
    
    January 23, 2017 at 1:04 pm
    
    For Linux that’s `-ffunction-sections -fdata-sections` for the compiler to generate a section per symbol, and `–gc-sections` for the linker to exclude those that are unreferred. Of course, the moment you include an object file that has a global with a constructor, your .ctors segment (which itself is force-included) will refer to your constructor, which refers to your vtable and all your children object constructors, which … and so on. I’ve seen “empty” files that only included a header file balloon up to a few thousand symbols, because of a possibly-last instance of a global template object instance (or something else – I never really got to the bottom of that)…
    
    Reply
    - brucedawson says:
      
      January 23, 2017 at 8:26 pm
      
      Thanks for the details.
      
      Chaos theory. It’s fractals all the way down. And excessive code-in-headers will be the death of us, in more ways than we realize.
      
      Reply
Ben Craig says:

January 23, 2017 at 6:21 am

I may be preaching to the choir here, but this reinforces some lessons I’ve learned the hard way.

Don’t define symbols that aren’t yours. log2 and log2f belong to the C-runtime. Even if your C-runtime doesn’t have them, they are still off limits. Random posix symbols that don’t exist on Windows aren’t yours, so don’t define those either. Defining symbols that don’t belong to you fails the “Can two people do this?” test.

A better approach is to make your own symbol, and have that forward to the C-runtime (or OS, or whatever) when the symbol is present. When the symbol isn’t present, you can have your own implementation of that function. blink_log2 and blink_log2f seem like the right names for the functions in this story.

Great investigation though.

Reply
- brucedawson says:
  
  January 23, 2017 at 8:49 am
  
  I agree. However the temptation to create a log2f symbol is great because it is so darned convenient, and it reduces the code churn. And, it is entirely safe as long as you remember to get rid of it when it is needed no longer. Well, entirely safe as long as (as you point out) nobody else does the same thing.
  
  Reply
brittany says:

January 23, 2017 at 9:32 am

Hey, I noticed you mention SymbolSort in your post. I have a project that I originally based on SymbolSort that maybe you would be interested in? It (attempts to) convert an entire PDB file into a mysql database. This effectively lets you perform sql queries on your pdbs.

https://github.com/briterator/drpdb

Reply
- brucedawson says:
  
  January 23, 2017 at 10:05 am
  
  Nice! Thanks for sharing.
  
  Reply
John Payson says:

January 23, 2017 at 12:08 pm

How substantial are the savings from link-time code optimization, to what extent are they concentrated in places which the authors of code would naturally expect them to be, and to what extend could they not be achieved via faster means? Having the release build process be so slow that it becomes impractical to test routinely doesn’t seem like a recipe for a robust product.

Reply
- brucedawson says:
  
  January 23, 2017 at 12:54 pm
  
  The improvements from LTCG are highly variable, but 5-15% is a good estimate. Using Profile Guided Optimization (which requires LTCG) gives another 5-10% for Chrome.
  
  We rarely have LTCG specific bugs so, while the long LTCG specific build times are a concern, they are not generally a serious problem.
  
  Without LTCG if you want a function to be inlined you have to put its implementation in a header file. Doing so leads to slower builds – much slower if you happen to be working on a function which would otherwise be in a .cpp file. Using LTCG means that we can keep function implementations out of header files without affecting run-time performance. This then means that our debug and component builds, which are optimized for build-time instead of run-time performance, can build quickly more often.
  
  I doubt we will abandon LTCG builds anytime soon. Computer time is cheaper than programmer time so rather than having developers tweak their function placement we will continuing having our compilers do that. We will focus on getting our LTCG builds to run faster.
  
  Reply
  - John Payson says:
    
    February 7, 2017 at 12:25 pm
    
    I was particularly curious about the second and third aspects of my question: if one can get a 10% speedup by subjecting the entire program to LTCG, but could get a 9% speedup by targeting a certain identifiable section, the latter approach might work out better. I would also expect that adding directives to assist an otherwise conservative compiler/optimizer would offer performance benefits comparable or better than those that could be achieved with aggressive optimization, but with far less risk.
    
    If LTCG optimizations were limited to things like optimizing register usage to avoid spills, development with quicker builds that generate less efficient code would seem reasonable. By my understanding, however, it has become fashionable for compilers to aggressively exploit concessions that the Standard made to support obscure hardware (e.g. an attempt to compare unrelated pointers may have arbitrary side-effects beyond yielding an arbitrary 0 or 1); having a compiler decide the code in one module can safely assume that two pointers will identify parts of the same object because some code in another module uses a relational test to decide whether to perform some operation in bottom-up or top-down order would seem like a recipe for disaster.
    
    Having directives to explicitly invite the compiler to make particular assumptions about variable values at various points in the code would seem like a much sounder approach than having compilers try to draw inferences for themselves. If desired, an analysis tool could suggest places where it looks like an assumption would facilitate better code generation, so as to allow programmers to then either invite the compiler to make such an assumption or specify that a code does in fact needs to handle cases the tool would have guessed were unimportant. Such an approach would allow optimizations which aren’t presently possible, it would allow humans to tell what is or is not being assumed.
    
    Reply
    - brucedawson says:
      
      February 13, 2017 at 7:36 pm
      
      Programmer time is expensive, so any solution that involves annotating code starts out at a huge economic disadvantage.
      
      Also, programmers generally don’t actually know where to put these annotations because they don’t really know the performance characteristics of their code that well. PGO (Profiled Guided Optimizations) do.
      
      I’m not sure what you are suggesting with your LTCG suggestions. Minimizing register spills doesn’t require LTCG. Cross-module inlining does. If you don’t want that optimization you probably shouldn’t use LTCG.
      
      Reply
      - John Payson says:
        
        February 28, 2017 at 1:17 pm
        
        Programmers are apt to know things that compilers aren’t, especially with regard to situations where a program is required to either yield a correct answer or report that it is unable to do so, and where a wide but bounded set of behaviors would be acceptable in the failure case. A lot of programs use conditional compilation to replace assertions with no-ops in release builds; I think it would be more helpful to have e.g. a __CHECKED_ASSUME directive which a compiler would be free to interpret as either a no-op or a disruptive-trap assertion at its leisure, bearing in mind that the latter might be efficient [since downstream code would only need to handle cases that aren’t trapped]. If this were augmented by unsequenced checked-assumption and assertion directives along with sequencing barriers, such that a compiler could have assertions or checked assumptions trap “early” provided they don’t cross a barrier, and could likewise have checked assumptions trap “late”, that would allow compilers to ensure that programs meet loose but critical requirements (e.g. don’t allow suppliers of maliciously-malformed data to execute arbitrary code) with far less overhead than would be required if traps had to be executed precisely in sequence.
        
        Compilers and programmers “know” different things. The more precisely a programmer can describe behavioral requirements, the more efficiently the compiler will be able to supply code meeting them.
        
        Reply