Roberto A. Vitillo: Clustering Firefox hangs |
Jim Chen recently implemented a system to collect stacktraces of threads running some code for more than 500ms. A summary of the aggregated data is displayed in a nice dashboard in which the top N aggregated stacks are shown according to different filters.
I have looked at a different way to group the frames that would help us identify the culprits of main-thread hangs, aka jank. The problem with aggregating stackframes and looking at the top N is that there is a very long tail of stacks that are not considered. It might very well be that in that tail some important patterns could be lurking that we are missing.
So I tried different clustering techniques until I settled with the very simple solution of aggregating the traces by their last frame. Why the last frame? When I used k-means to cluster the traces I noticed that, for many of the more interesting clusters the algorithm found, most stacks had the last frame in common, e.g.:
Aggregating by the last frame yields clusters that are big enough to be considered interesting in terms of number of stacktraces and are likely to explain the most common issues our users experience.
Currently on Aurora, the top 10 meaningful offending main-thread frames are in order of importance:
Even without showing sample stacks for each cluster, there is some useful information here. The elephants in the room are clearly plugins; or should I say Flash? But just how much do “plugins” hurt our responsiveness? In total, plugin related traces account for about 15% of all hangs. It also seems that the median duration of a plugin hang is not different from a non-plugin one, i.e. between 1 and 2 seconds.
But just how often does a hang occur during a session? Let’s have a look:
The median number of hangs for a session amounts to 3; the mean is not that interesting as there are big outliers that skew the data. Also note that the median duration of a session is about 13 minutes.
As one would expect, the median number of hangs increases as the duration of a session does:
Tha analysis was run on a week’s worth of data for Aurora (over 50M stackframes) and I got similar results when re-running on previous weeks, so those numbers seem to be pretty stable.
There is some work in progress to improve the status quo. Aaron Klotz’s formidable async plugin initialization is going to eliminate trace 4 and he might tackle frame 8 in the future. Furthermore, a recent improvent in cycle collection is hopefully going to reduce the impact of frame 2.
http://robertovitillo.com/2014/11/25/clustering-firefox-hangs/
Комментировать | « Пред. запись — К дневнику — След. запись » | Страницы: [1] [Новые] |