Since December 1
st 1975, by FAA mandate, no plane has been allowed to fly without a "Ground Proximity Warning System" GPWS (or one of its successors).
[1] For good reason too, as it's been figured that 75% of the fatalities just one year prior (1974) could have been prevented using the system.
[2]In a slew of case studies, reviewers reckoned that a GPWS may have prevented crashes by giving pilots additional time to act before they smashed into the ground. Often, the GPWS's signature "Whoop, Whoop: Pull Up!" would have sounded a full fifteen seconds before any other alarms triggered.
[3]Instruments like this are indispensable to aviation because pilots operate in an environment outside of any realm where human intuition is useful. Lacking augmentation, our bodies and minds are simply not suited to the task of flying airliners.
For the same reason, thick layers of instrumentation and early warning systems are necessary for managing technical infrastructure. Like pilots, without proper tooling, system administrators often plow their vessels into the earth....
The St. Patrick's Day MassacreCase in point, on Saint Patrick's Day we suffered two outages which could have likely been avoided via some additional alerts and a slightly modified deployment process.
The first outage was caused by the accidental removal of a variable from a config file which one of our utilities depends on. Our utilities are all managed by a dependency system called
runner, and when any task fails the machine is prevented from doing work until it succeeds. This all-or-nothing behavior is correct, but should not lead to closed trees....
On our runner dashboards, the whole event looked like this (the smooth decline on the right is a fix being rolled out with ansible):

The second, and most severe, outage was caused by an insufficient wait time between retries upon failing to pull from our mercurial repositories.
There was a temporary disruption in service, and a large number of slaves failed to clone a repository. When this herd of machines began retrying the task it became the equivalent of a DDoS attack.
From the repository's point of view, the explosion looked like this:

Then, from runner's point of view, the retrying task:

In both of these cases, despite having the data (via runner logging), we missed the opportunity to catch the problem before it caused system downtime. Furthermore, especially in the first case, we could have avoided the issue even earlier by testing our updates and rolling them out gradually.
Avoiding Future MassacresAfter these fires went out, I started working on a RelEng version of the Ground Proximity Warning System, to keep us from crashing in the future. Here's the plan:
1.)
Bug 1146974 - Add automated alerting for abnormally high retries (in runner).In both of the above cases, we realized that things had gone amiss based on job backlog alerts. The problem is, once we have a large enough backlog to trigger those alarms, we're already hosed.
The good news is, the backlog is preceded by a spike in runner retries. Setting up better alerting here should buy us as much as an extra hour to respond to trouble.
We're already logging all task results to influxdb, but, alerting via that data requires a custom nagios script. Instead of stringing that together, I opted to write runner output to syslog where it's being aggregated by
papertrail.
Using papertrail, I can grep for runner retries and build alarms from the data. Below is a screenshot of our runner data in the papertrail dashboard:

2.)
Add automated testing, and tiered roll-outs to golden ami generationFinally, when we update our slave images the new version is not rolled out in a precise fashion. Instead, as old images die (3 hours after the new image releases) new ones are launched on the latest version. Because of this, every deploy is an all-or-nothing affair.
By the time we notice a problem, almost all of our hosts are using the bad instance and rolling back becomes a huge pain. We also do rollbacks by hand. Nein, nein, nein.
My plan here is to launch new instances with a weighted chance of picking up the latest ami. As we become more confident that things aren't breaking -- by monitoring the runner logs in papertrail/influxdb -- we can increase the percentage.
The new process will work like this:
00:00 - new AMI generated
00:01 - new slaves launch with a 12.5% chance of taking the latest version.
00:45 - new slaves launch with a 25% chance of taking the latest version.
01:30 - new slaves launch with a 50% chance of taking the latest version.
02:15 - new slaves launch with a 100% chance of taking the latest version.
Lastly, if we want to roll back, we can just lower the percentage down to zero while we figure things out. This also means that we can create sanity checks which roll back bad amis without any human intervention whatsoever.
The intention being, any failure within the first 90 minutes will trigger a rollback and keep the doors open....
http://linux-poetry.com/blog/section/mozilla/19/