Benoit Girard: Using RecordReplay to investigate intermittent oranges

Вторник, 26 Января 2016 г. 01:16 + в цитатник

This is a quick write up to summarize my, and Jeff’s, experience, using RR to debug a fairly rare intermittent reftest failure. There’s still a lot of be learned about how to use RR effectively so I’m hoping sharing this will help others.

Finding the root of the bad pixel

First given a offending pixel I was able to set a breakpoint on it using these instructions. Next using rr-dataflow I was able to step from the offending bad pixel to the display item responsible for this pixel. Let me emphasize this for a second since it’s incredibly impressive. rr + rr-dataflow allows you to go from a buffer, through an intermediate surface, to the compositor on another thread, through another intermediate surface, back to the main thread and eventually back to the relevant display item. All of this was automated except for when the two pixels are blended together which is logically ambiguous. The speed at which rr was able to reverse continue through this execution was very impressive!

Here’s the trace of this part: rr-trace-reftest-pixel-origin

Understanding the decoding step

From here I started comparing a replay of a failing test and a non failing step and it was clear that the DisplayList was different. In one we have a nsDisplayBackgroundColor in the other we don’t. From here I was able to step through the decoder and compare the sequence. This was very useful in ruling out possible theories. It was easy to step forward and backwards in the good and bad replay debugging sessions to test out various theories about race conditions and understanding at which part of the decode process the image was rejected. It turned out that we sent two decodes, one for the metadata that is used to sized the frame tree and the other one for the image data itself.

Comparing the frame tree

In hindsight, it would have been more effective to start debugging this test by looking at the frame tree (and I imagine for other tests looking at the display list and layer tree) first would have been a quicker start. It works even better if you have a good and a bad trace to compare the difference in the frame tree. From here, I found that the difference in the layer tree came from a change hint that wasn’t guaranteed to come in before the draw.

The problem is now well understood: When we do a sync decode on reftest draw, if there’s an image error we wont flush the style hints since we’re already too deep in the painting pipeline.

Take away

Finding the root cause of a bad pixel is very easy, and fast, to do using rr-dataflow.
However it might be better to look for obvious frame tree/display list/layer tree difference(s) first.
Debugging a replay is a lot simpler then debugging against non-determinist re-runs and a lot less frustrating too.
rr is really useful for race conditions, especially rare ones.