Mark Finkle: Fun with Telemetry: DIY User Analytics Lab in SQL

Понедельник, 22 Февраля 2016 г. 22:31 + в цитатник

Firefox on Mobile has a system to collect telemetry data from user interactions. We created a simple event and session UI telemetry system, built on top of the core telemetry system. The core telemetry system has been mainly focused on performance and stability. The UI telemetry system is really focused on how people are interacting with the application itself.

Event-based data streams are commonly used to do user data analytics. We’re pretty fortunate to have streams of events coming from all of our distribution channels. I wanted to start doing different types of analyses on our data, but first I needed to build a simple system to get the data into a suitable format for hacking.

One of the best one-stop sources for a variety of user analytics is the Periscope Data blog. There are posts on active users, retention and churn, and lots of other cool stuff. The blog provides tons of SQL examples. If I could get the Firefox data into SQL, I’d be in a nice place.

Collecting Data

My first step is performing a little ETL (well, the E & T parts) on the raw data using Spark/Python framework for Mozilla Telemetry. I wanted to create two dataset:

clients: Dataset of the unique clients (users) tracked in the system. Besides containing the unique clientId, I wanted to store some metadata, like the profile creation date. (script)
events: Dataset of the event stream, associated to each client. The event data also has information about active A/B experiments. (script)

Building a Database

I installed Postgres on a Mac Mini (powerful stuff, I know) and created my database tables. I was periodically collecting the data via my Spark scripts and I couldn’t guarantee I wouldn’t re-collect data from the previous jobs. I couldn’t just bulk insert the data. I wrote some simple Python scripts to quickly import the data (clients & events), making sure not to create any duplicates.

fennec-telemetry-data

I decided to start with 30 days of data from our Nightly and Beta channels. Nightly was relatively small (~330K rows of events), but Beta was more significant (~18M rows of events).

Analyzing and Visualizing

Now that I had my data, I could start exploring. There are a lot of analysis/visualization/sharing tools out there. Many are commercial and have lots of features. I stumbled across a few open-source tools:

Airpal: A web-based query execution tool from Airbnb. Makes it easy to save and share SQL analysis queries. Works with Facebook’s PrestoDB, but doesn’t seem to create any plots.
Re:dash: A web-based query, visualization and collaboration tool. It has tons of visualization support. You can set it up on your own server, but it was a little more than I wanted to take on over a weekend.
SQLPad: A web-based query and visualization tool. Simple and easy to setup, so I tried using it.

Even though I wanted to use SQLPad as much as possible, I found myself spending most of my time in pgAdmin. Debugging queries, using EXPLAIN to make queries faster, and setting up indexes. It was easier in pgAdmin. Once I got the basic things figured out, I was able to more efficiently use SQLPad. Below are some screenshots using the Nightly data:

sqlpad-query

sqlpad-chart

Next Steps

Now that I have Firefox event data in SQL, I can start looking at retention, churn, active users, engagement and funnel analysis. Eventually, we want this process to be automated, data stored in Redshift (like a lot of other Mozilla data) and exposed via easy query/visualization/collaboration tools. We’re working with the Mozilla Telemetry & Data Pipeline teams to make that happen.

A big thanks to Roberto Vitillo and Mark Reid for the help in creating the Spark scripts, and Richard Newman for double-dog daring me to try this.

http://starkravingfinkle.org/blog/2016/02/fun-with-telemetry-diy-user-analytics-lab-in-sql/