Roberto A. Vitillo: Main-thread IO Analysis

Пятница, 21 Февраля 2014 г. 20:35 + в цитатник

The goal of this post is to show that main-thread IO matters and not to “shame” this or that add-on, or platform subsystem. There is only so much I can do alone by filing bugs and fixing some of them. By divulging knowledge about the off-main thread IO API and why it matters, my hope is that Firefox and add-ons developers alike will try to exploit it the next time an opportunity arises. I changed the title of this post accordingly to reflect my intentions.
Please note that just because some code performs IO on the main-thread, it doesn’t necessarily mean that it can be avoided. This has to be evaluated case by case.

Recently a patch landed in Firefox that allows telemetry to collect aggregated main-thread disk IO timing per filename. Each filename comes with its accumulated time spent and the number of operations performed for open(), read(), write(), fsync() and stat(). Yoric wrote an interesting article some time ago on why main-thread IO can be a serious performance issue.

I collected a day worth of Nightly data using Telemetry’s map-reduce framework and filtered out all those filenames that were not common enough (< 10%) among the submissions. What follows are some interesting questions we could answer thanks to the newly collected data.

How common are the various operations?
The most common operation is stat() followed by open(), all other operations are very uncommon with a third quartile of 0.
Is there a correlation between the number of operations and the time spent doing IO on a file?
There is an extremely weak correlation which is what one would expect. Generally, what really makes a difference is the amount of bytes read and written for instance, not the number of actual operations performed.
Does having a SSD disk lower the time spent doing IO on a file?
A Mann-Whitney test confirms that the distributions of time conditioned on “having a SSD” and “not having a SSD” seem to be indeed statistically different. On average having an SSD disk lowers the time by a factor of 2.7.

The above facts apply to all submissions, but what we are really interested in is discerning those files for which the accumulated time is particularly high. An interesting exercise is to compare some distributions of the top 100 files (in terms of time) against the distributions of the general population to determine if bad timings are correlated with older machines:

Does the distribution of Windows versions differ between the population and the top 100?
Fisher’s test confirms that the distributions do not differ (with high probability), i.e. the OS version doesn’t seem to influence the timings.
Does the distribution of CPU architectures (32 vs 64 bit) differ between the population and the top 100?
Fisher’s test confirms that the distributions do not seem to differ.
Does the distribution of the disk type differ between the population and the top 100?
The distributions do not seem to differ according to Fisher’s test.

Can we deduce something from all this? It seems that files that behave badly do so regardless of the machine’s configuration. That is, removing the offending operations performing IO in the main-thread will likely benefit all users.

Which brings us to the more interesting question of how we should prioritize the IO operations that have to be moved off the main-thread. I am not a big fan of looking only at aggregates of data since outliers turn out to be precious more often than not. Let’s have a look at the distrubtion of the top 100 outliers.

1. Distribution of the top outliers.

Whoops, looks like httpDataUsage.dat, place-sqlite-wal, omni.ja and prefs.js behave particularly bad quite often. How bad is bad? For instance, one entry of httpDataUsage.dat performs about 100 seconds of IO operations (38 stat & 38 open), i.e. ~ 1.3 seconds per operation!

Now that we have a feeling of the outliers, let’s have a look at the aggregated picture. Say we are interested in those files most users hit (> 90%) and rank them by their third quartile:

2. Top 20 files most users hit.

In general the total time spent doing IO is about 2 to 3 orders of magnitude lower than for the outliers.

But what happened to the add-ons? It turns out that the initial filtering we applied on our dataset removed most of them. Let’s rectify the situation by taking a dataset with the single hottest entry for each file, in terms of time, and aggregate the filenames by extension: