Chris H-C: Data Science is Hard: Validating Data for Glean |
Glean is a new library for collecting data in Mozilla products. It’s been shipping in Firefox Preview for a little while and I’d like to take a minute to talk about how I validated that it sends what we think it’s sending.
Validating new data collections in an existing system like Firefox Desktop Telemetry is a game of comparing against things we already know. We know that some percentage of data we receive is just garbage: bad dates, malformed records, attempts at buffer overflows and SQL injection. If the amount of garbage in the new collection is within the same overall amount of garbage we see normally, we count it as “good enough” and move on.
With new data collection from a new system like Glean coming from new endpoints like the reference browser and Firefox Preview, we’re given an opportunity to compare against the ideal. Maybe the correct number of failures is 0?
But what is a failure? What is acceptable behaviour?
We have an “events” ping in Glean: can the amount of time covered by the events’ timestamps ever exceed the amount of time covered by the ping? I didn’t think so, but apparently it’s an expected outcome when the events were restored from a previous session.
So how do you validate something that has unknown error states?
I started with a list of things any client-based network-transmitted data collection system had to have:
From there we can dive into validating specifics about the data collections:
Once we can establish confidence in the overall health of the data reporting mechanism we can start using it to report errors about itself:
All these attempts to determine what is reasonable and what is correct depend on a strong foundation of documentation. I can read the code that collects the data and sends the pings… but that tells me what is correct relative to what is implemented, not what is correct relative to what is intended.
By validating to the documentation, to what is intended, we can not only find bugs in the code, we can find bugs in the docs. And a data collection system lives and dies on its documentation: it is in many ways a more important part of the “product” than the code.
At this point, aside from the “metrics” ping which is awaiting validation after some fixes reach saturation in the population, Glean has passed all of these criteria acceptably. It still has a bit of a duplicate ping problem, but its clock skew and latency are much lower than Firefox Desktop’s. There are some outrageous clients sending dozens of pings over a period that they should be sending a handful, but that might just be a test client whose values will disappear into the noise when the user population grows.
:chutten
https://chuttenblog.wordpress.com/2019/06/10/data-science-is-hard-validating-data-for-glean/
Комментировать | « Пред. запись — К дневнику — След. запись » | Страницы: [1] [Новые] |