Richard Newman: Syncing and storage on three platforms

Четверг, 24 Декабря 2015 г. 21:31 + в цитатник

As it’s Christmas, I thought I’d take a moment to write down my reflections on Firefox Sync’s iterations over the years. This post focuses on how they actually sync — not the UI, not the login and crypto parts, but how they decide that something has changed and what they do about it.

I’ve been working on Sync for more than five years now, on each of its three main client codebases: first desktop (JavaScript), then Android (built from scratch in Java), and now on iOS (in Swift).

Desktop’s overall syncing strategy is unchanged from its early life as Weave.

Partly as a result of Conway’s Law writ large — Sync shipped as an add-on, built by the Services team rather than the Firefox team, with essentially no changes to Firefox itself — and partly for good reasons, Sync was separate from Firefox’s storage components.

It uses Firefox’s observer notifications to observe changes, making a note of changed records in what it calls a Tracker.

This is convenient, but it has obvious downsides:

From an organizational perspective, it’s easy for developers to disregard changes that affect Sync, because the code that tracks changes is isolated. For example, desktop Sync still doesn’t behave correctly in the presence of fancy Firefox features like Clear Recent History, Clear Private Data, restoring bookmark backups, etc.
Sync doesn’t get observer notifications for all events. Most notably, bulk changes sometimes roll-up or omit events, and it’s always possible for code to poke at databases directly, leaving Sync out of the loop. If a Places database is corrupt, or a user replaces it manually, Sync’s tracking will be wrong. This is almost inevitable when sync metadata doesn’t live with the data it tracks.
Sync doesn’t track actual changes; it tracks changed IDs. When a sync occurs, it goes to storage to get a current representation of the changed record. (If the record is missing, we assume it was deleted.) This makes it very difficult to do good conflict resolution.
In order to avoid cycles, Sync stops listening for events while it’s syncing. That means it misses any changes the user makes during a sync.
Similarly, it doesn’t see changes that happen before it registers its observers, e.g., during the first few seconds of using the browser.

Beyond the difficulties introduced by a reliance on observers, desktop Sync took some shortcuts ¹: it applies incoming records directly and non-transactionally to storage, so an interrupted sync leaves local storage in a partial state. That’s usually OK for unstructured data like history — it’ll try again on the next sync, and eventually catch up — but it’s a bad thing for something structured like bookmarks, and can still be surprising elsewhere (e.g., passwords that aren’t consistent across your various intranet pages, form fields that are mismatched so you get your current street address and your previous city and postal code).

During the last days of the Services team, Philipp, Greg, myself, and others were rethinking how we performed syncs. We settled on a repository-centric approach: records were piped between repositories (remote or local), abstracting away the details of how a repository figured out what had changed, and giving us the leeway to move to a better internal structure.

That design never shipped on desktop, but it was the basis for our Sync implementation on Android.

Android presented some unique constraints. Again, Conway’s Law applied, albeit to a lesser extent, but also the structure of the running code had to abide by Android’s ContentProvider/SyncAdapter/Activity patterns.

Furthermore, Fennec was originally planning to support Android’s own internal bookmark and history storage, so its internal databases mirrored that schema. You can still see the fossilized remnants of that decision in the codebase today. When that plan was nixed, the schema was already starting to harden. The compromise we settled on was to use modification timestamps and deletion flags in Fennec’s content providers, and use those to extract changes for Sync in a repository model.

Using timestamps as the basis for tracking changes is a common error when developers hack together a synchronization system. They’re convenient, but client clocks are wrong surprisingly often, jump around, and lack granularity. Clocks from different devices shouldn’t be compared, but we do it anyway when reconciling conflicts. Still, it’s what we had to work with at the time.

The end result is over-engineered, fundamentally flawed, still directly applies records to storage, but works well enough. We have seen dramatically fewer bugs in Android Sync than we saw in desktop Sync between 2010 and 2012. I attribute some of that simply to the code having been written for production rather than being a Labs project (the desktop bookmark sync code was particularly flawed, and Philipp and I spent a lot of time making it better), some of it to lessons learned, and some of it to better languages and tooling — Java and Eclipse produce code with fewer silly bugs ² than JavaScript and Vim.

On iOS we had the opportunity to learn from the weaknesses in the previous two implementations.

The same team built the frontend, storage, and Sync, so we put logic and state in the right places. We track Sync-related metadata directly in storage. We can tightly integrate with bulk-deletion operations like Clear Private Data, and change tracking doesn’t rely on timestamps: it’s an integral part of making the change itself.

We also record enough data to do proper three-way merges, which avoids a swath of quiet data loss bugs that have plagued Sync over the years (e.g., recent password changes being undone).

We incrementally apply chunks of records, downloaded in batches, so we rarely need to re-download anything in the case of mid-sync failures.

And we buffer downloaded records where appropriate, so the scary part of syncing — actually changing the database — can be done locally with offline data, even within a single transaction.

Storage on iOS is significantly more involved as a result: we have sync_status columns on each table, and typically have two tables per datatype to track the original shared parent of a row. Bookmark sync is shaping up to involve six tables. But the behavior of the system is dramatically more predictable; this is a case of modeling essential complexity, not over-complicating. So far the bug rate is low, and our visibility into the interactions between parts of the code is good — for example, it’s just not possible for Steph to implement bulk deletions of logins without having to go through the BrowserLogins protocol, which does all the right flipping of change flags.

In the future we’re hoping to see some of the work around batching, use of in-storage tracking flags, and three-way merge make it back to Android and eventually to desktop. Mobile first!

Notes:

My feeling is that Weave was (at least from a practical standpoint) originally designed to sync two desktops with good network connections, using cheap servers that could die at any moment. That attitude doesn’t fit well with modern instant syncing between your phone, tablet, and laptop!

http://160.twinql.com/syncing-and-storage-on-three-platforms/