Roberto A. Vitillo: Longitudinal studies with Telemetry

Понедельник, 21 Марта 2016 г. 16:31 + в цитатник

Unified Telemetry (UT) replaced both Telemetry and FHR at the end of last year. FHR has been historically used to answer longitudinal questions, such as churn, while Telemetry has mainly been used for performance studies.

In UT-land, multiple self-contained submissions are generated for a profile over time in contrast to FHR for which submissions contained all historical records. With the new format, a typical longitudinal query on the raw data requires conceptually a big group-by on the profile ID over all submissions to recreate the history for each profile. To avoid the expensive grouping operation we are providing a longitudinal batch view of our Telemetry dataset.

The longitudinal view is logically organized as a table where rows represent profiles and columns the various metrics (e.g. startup time). Each field of the table contains a list of chronologically sorted values, one per Telemetry submission received for that profile. Even though each profile could have been represented as an array of structures with Parquet, ultimately we decided to represent it as a structure of arrays to deal with some inefficiencies in reading nested data from Spark.

The dataset is periodically regenerated from scratch, which allows us to apply non backward compatible changes to the schema and not worry about merging procedures.

The dataset can be accessed from Mozilla’s Spark clusters as a DataFrame:

frame = sqlContext.sql(SELECT * FROM longitudinal)

The view contains several thousand measures, which include all histograms and a large part of our scalar metrics stored in the various sections of Telemetry submissions.

frame.printSchema()

root
 |-- profile_id: string
 |-- normalized_channel: string
 |-- submission_date: array
 |    |-- element: string
 |-- build: array
 |    |-- element: struct
 |    |    |-- application_id: string
 |    |    |-- application_name: string
 |    |    |-- architecture: string
 |    |    |-- architectures_in_binary: string
 |    |    |-- build_id: string
 |    |    |-- version: string
 |    |    |-- vendor: string
 |    |    |-- platform_version: string
 |    |    |-- xpcom_abi: string
 |    |    |-- hotfix_version: string
...

A Jupyter notebook with example queries that use the longitudinal dataset is available on Spark clusters launched from a.t.m.o.