Nicholas Nethercote: How to speed up the Rust compiler one last time

Вторник, 08 Сентября 2020 г. 01:40 + в цитатник

Due to recent changes at Mozilla my time working on the Rust compiler is drawing to a close. I am still at Mozilla, but I will be focusing on Firefox work for the foreseeable future.

So I thought I would wrap up my “How to speed up the Rust compiler” series, which started in 2016.

Looking back

I wrote ten “How to speed up the Rust compiler” posts.

How to speed up the Rust compiler.The original post, and the one where the title made the most sense. It focused mostly on how to set up the compiler for performance work, including profiling and benchmarking. It mentioned only four of my PRs, all of which optimized allocations.
How to speed up the Rust compiler some more. This post switched to focusing mostly on my performance-related PRs (15 of them), setting the tone for the rest of the series. I reused the “How to…” naming scheme because I liked the sound of it, even though it was a little inaccurate.
How to speed up the Rust compiler in 2018. I returned to Rust compiler work after a break of more than a year. This post included updated info on setting things up for profiling the compiler and described another 7 of my PRs.
How to speed up the Rust compiler some more in 2018. This post described some improvements to the standard benchmarking suite and support for more profiling tools, covering 14 of my PRs. Due to multiple requests from readers, I also included descriptions of failed optimization attempts, something that proved popular and that I did in several subsequent posts. (A few times, readers made suggestions that then led to subsequent improvements, which was great.)
How to speed up the Rustc compiler in 2018: NLL edition. This post described 13 of my PRs that helped massively speed up the new borrow checker, and featured my favourite paragraph of the entire series: “the html5ever benchmark was triggering out-of-memory failures on CI… over a period of 2.5 months we reduced the memory usage from 14 GB, to 10 GB, to 2 GB, to 1.2 GB, to 600 MB, to 501 MB, and finally to 266 MB”. This is some of the performance work I’m most proud of. The new borrow checker was a huge win for Rust’s usability and it shipped with very little hit to compile times, an outcome that was far from certain for several months.
How to speed up the Rust compiler in 2019. This post covered 44(!) of my PRs including ones relating to faster globals accesses, profiling improvements, pipelined compilation, and a whole raft of tiny wins from reducing allocations with the help of the new version of DHAT.
How to speed up the Rust compiler some more in 2019. This post described 11 of my PRs, including several minimising calls to memcpy, and several improving the ObligationForest data structure. It discussed some PRs by others that reduced library code bloat. I also included a table of overall performance changes since the previous post, something that I continued doing in subsequent posts.
How to speed up the Rust compiler one last time in 2019. This post described 21 of my PRs, including two separate sequences of refactoring PRs that unintentionally led to performance wins.
How to speed up the Rust compiler in 2020: This post described 23 of my successful PRs relating to performance, including a big change and win from avoiding the generation of LLVM bitcode when LTO is not being used (which is the common case). The post also described 5 of my PRs that represented failed attempts.
How to speed up the Rust compiler some more in 2020: This post described 19 of my PRs, including several relating to LLVM IR reductions found with cargo-llvm-lines, and several relating to improvements in profiling support. The post also described the important new weekly performance triage process that I started and is on track to be continued by others.

Beyond those, I wrote several other posts related to Rust compilation.

The Rust compiler is getting faster. This post provided some measurements showing significant overall speed improvements.
Ad Hoc Profiling. This post described a simple but surprisingly effective profiling tool that I used for a lot of my PRs.
How to get the size of Rust types with -Zprint-type-sizes. This post described how to see how Rust types are laid out in memory.
A better DHAT. This post described improvements I made to DHAT, which I used for a lot of my PRs.
The Rust compiler is still getting faster. Another status post about speed improvements.
Visualizing Rust compilation. This post described a new Cargo feature that produces graphs showing the compilation of Rust crates in a project. It can help project authors rearrange their code for faster compilation.

As well as sharing the work I’d been doing, a goal of the posts was to show that there are people who care about Rust compiler performance and that it was actively being worked on.

Lessons learned

Boiling down compiler speed to a single number is difficult, because there are so many ways to invoke a compiler, and such a wide variety of workloads. Nonetheless, I think it’s not inaccurate to say that the compiler is at least 2-3x faster than it was a few years ago in many cases. (This is the best long-range performance tracking I’m aware of.)

When I first started profiling the compiler, it was clear that it had not received much in the way of concerted profile-driven optimization work. (It’s only a small exaggeration to say that the compiler was basically a stress test for the allocator and the hash table implementation.) There was a lot of low-hanging fruit to be had, in the form of simple and obvious changes that had significant wins. Today, profiles are much flatter and obvious improvements are harder for me to find.

My approach has been heavily profiler-driven. The improvements I did are mostly what could be described as “bottom-up micro-optimizations”. By that I mean they are relatively small changes, made in response to profiles, that didn’t require much in the way of top-down understanding of the compiler’s architecture. Basically, a profile would indicate that a piece of code was hot, and I would try to either (a) make that code faster, or (b) avoid calling that code.

It’s rare that a single micro-optimization is a big deal, but dozens and dozens of them are. Persistence is key.

I spent a lot of time poring over profiles to find improvements. I have measured a variety of different things with different profilers. In order of most to least useful:

Instruction counts (Cachegrind and Callgrind)
Allocations (DHAT)
All manner of custom path and execution counts via ad hoc profiling (counts)
Memory use (DHAT and Massif)
Lines of LLVM IR generated by the front end (cargo-llvm-lines)
memcpys (DHAT)
Cycles (perf), but only after I discovered the excellent Hotspot viewer… I find perf’s own viewer tools to be almost unusable. (I haven’t found cycles that useful because they correlate strongly with instruction counts, and instruction count measurements are less noisy.)

Every time I did a new type of profiling, I found new things to improve. Often I would use multiple profilers in conjunction. For example, the improvements I made to DHAT for tracking allocations and memcpys were spurred by Cachegrind/Callgrind’s outputs showing that malloc/free and memcpy were among the hottest functions for many benchmarks. And I used counts many times to gain insight about a piece of hot code.

Off the top of my head, I can think of some unexplored (by me) profiling territories: self-profiling/queries, threading stuff (e.g. lock contention, especially in the parallel front-end), cache misses, branch mispredictions, syscalls, I/O (e.g. disk activity). Also, there are lots of profilers out there, each one has its strengths and weaknesses, and each person has their own areas of expertise, so I’m sure there are still improvement to be found even for the profiling metrics that I did consider closely.

I also did two larger “architectural” or “top-down” changes: pipelined compilation and LLVM bitcode elision. These kinds of changes are obviously great to do when you can, though they require top-down expertise and can be hard for newcomers to contribute to. I am pleased that there is an incremental compilation working group being spun up, because I think that is an area where there might be some big performance wins.

Good benchmarks are important because compiler inputs are complex and highly variable. Different inputs can stress the compiler in very different ways. I used rustc-perf almost exclusively as my benchmark suite and it served me well. That suite changed quite a bit over the past few years, with various benchmarks being added and removed. I put quite a bit of effort into getting all the different profilers to work with its harness. Because rustc-perf is so well set up for profiling, any time I needed to do some profiling of some new code I would simply drop it into my local copy of rustc-perf.

Compilers are really nice to profile and optimize because they are batch programs that are deterministic or almost-deterministic. Profiling the Rust compiler is much easier and more enjoyable than profiling Firefox, for example.

Contrary to what you might expect, instruction counts have proven much better than wall times when it comes to detecting performance changes on CI, because instruction counts are much less variable than wall times (e.g. ±0.1% vs ±3%; the former is highly useful, the latter is barely useful). Using instruction counts to compare the performance of two entirely different programs (e.g. GCC vs clang) would be foolish, but it’s reasonable to use them to compare the performance of two almost-identical programs (e.g. rustc before PR #12345 and rustc after PR #12345). It’s rare for instruction count changes to not match wall time changes in that situation. If the parallel version of the rustc front-end ever becomes the default, it will be interesting to see if instruction counts continue to be effective in this manner.

I was surprised by how many people said they enjoyed reading this blog post series. (The positive feedback partly explains why I wrote so many of them.) The appetite for “I squeezed some more blood from this stone” tales is high. Perhaps this relates to the high level of interest in Rust, and also the pain people feel from its compile times. People also loved reading about the failed optimization attempts.

Many thanks to all the people who helped me with this work. In particular:

Mark Rousskov, for maintaining rustc-perf and the CI performance infrastructure, and helping me with many rustc-perf changes;
Alex Crichton, for lots of help with pipelined compilation and LLVM bitcode elision;
Anthony Jones and Eric Rahm, for understanding how this Rust work benefits Firefox and letting me spend some of my Mozilla working hours on it.

Rust’s existence and success is something of a miracle. I look forward to being a Rust user for a long time. Thank you to everyone who has contributed, and good luck to all those who will contribute to it in the future!

https://blog.mozilla.org/nnethercote/2020/09/08/how-to-speed-up-the-rust-compiler-one-last-time/