Chris H-C: Data Science is Hard: Validating Data for Glean

Понедельник, 10 Июня 2019 г. 19:09 + в цитатник

Glean is a new library for collecting data in Mozilla products. It’s been shipping in Firefox Preview for a little while and I’d like to take a minute to talk about how I validated that it sends what we think it’s sending.

Validating new data collections in an existing system like Firefox Desktop Telemetry is a game of comparing against things we already know. We know that some percentage of data we receive is just garbage: bad dates, malformed records, attempts at buffer overflows and SQL injection. If the amount of garbage in the new collection is within the same overall amount of garbage we see normally, we count it as “good enough” and move on.

With new data collection from a new system like Glean coming from new endpoints like the reference browser and Firefox Preview, we’re given an opportunity to compare against the ideal. Maybe the correct number of failures is 0?

But what is a failure? What is acceptable behaviour?

We have an “events” ping in Glean: can the amount of time covered by the events’ timestamps ever exceed the amount of time covered by the ping? I didn’t think so, but apparently it’s an expected outcome when the events were restored from a previous session.

So how do you validate something that has unknown error states?

I started with a list of things any client-based network-transmitted data collection system had to have:

How many pings (data transmissions) are there?
How many measurements are in those pings?
How many clients are sending these pings?
How often?
How long do they take to get to our servers?
How many poorly-structured pings are sent? By how many clients? How often?
How many pings with bad values are sent? By how many clients? How often?

From there we can dive into validating specifics about the data collections:

Do the events in the “events” ping have timestamps with reasonable separations? (What does reasonable mean here? Well, it could be anything, but if the span between two timestamps is measured in years, and the app has only been available for some number of weeks, it’s probably broken.)
Are the GUIDs in the pings actually globally unique? Are we seeing duplicates? (We are, but not many)
Are there other ping fields that should be unique, but aren’t? (For Glean no client should ever send the same ping type with the same sequence number. But that kind of duplicate appears, too)

Once we can establish confidence in the overall health of the data reporting mechanism we can start using it to report errors about itself:

Ping transmission should be quick (because they’re small). Assuming the ping transmission time is 0, how far away are the clients’ clocks from the server’s clock? (AKA “Clock skew”. Turns out that mobile clients’ clocks are more reliable than desktop clients’ clocks (at least in the prerelease population. We’ll see what happens when we start measuring non-beta users))
How many errors are reported by internal error-reporting metrics? How many send failures? How many times did the app try to record a string that was too long?
What measurements are in the ping? Are they only the ones we expect to see? Are they showing in reasonable proportions relative to each other and the number of clients and pings reporting them?

All these attempts to determine what is reasonable and what is correct depend on a strong foundation of documentation. I can read the code that collects the data and sends the pings… but that tells me what is correct relative to what is implemented, not what is correct relative to what is intended.

By validating to the documentation, to what is intended, we can not only find bugs in the code, we can find bugs in the docs. And a data collection system lives and dies on its documentation: it is in many ways a more important part of the “product” than the code.

At this point, aside from the “metrics” ping which is awaiting validation after some fixes reach saturation in the population, Glean has passed all of these criteria acceptably. It still has a bit of a duplicate ping problem, but its clock skew and latency are much lower than Firefox Desktop’s. There are some outrageous clients sending dozens of pings over a period that they should be sending a handful, but that might just be a test client whose values will disappear into the noise when the user population grows.

:chutten

https://chuttenblog.wordpress.com/2019/06/10/data-science-is-hard-validating-data-for-glean/