Комментарии (0)

Doug Belshaw: Weeknote 13/2015

Суббота, 28 Марта 2015 г. 21:12 + в цитатник

This week I’ve been:

Mozilla

Finishing off my part of the Hive Toronto Privacy badges project. GitHub repo here.
Submitting my final expenses and health & wellness invoices.
Writing about Web Literacy Map v1.5 (my last post on the Webmaker blog!)
Editing the Learning Pathways whitepaper. I’ll do as much as I can, but it’s up to Karen Smith to shepherd from this point forward!
Backing up everything.
Catching-up one to one with a few people.
Leaving Mozilla. I wrote about that here. Some colleagues gave me a Gif tribute send-off and dressed up an inflatable dinosaur in a party hat. Thanks guys!

Dynamic Skillset

Helping out DigitalMe with an event in Leeds around Open Badges. I wrote that up here.
Preparing my presentation for a keynote next week.
Collaborating on a proposal to scope out Open Badges for UK Scouting.
Replying to lots of people/organisations who’d like to work with me!
Finalising things for next week when I start working with City & Guilds for most (OK, nearly all) of my working week.
Getting to grips with Xero (which is what I’m using for accounting/invoicing)

Other

Recording Episode 3 of #tidetalks with Dai Barnes. Subscribe here.
Organising a ICT club in my children’s school for after Easter (loosely based on Mozilla’s draft Web Literacy Basics curriculum)
Upgrading my Sony Xperia Z Ultra using this ROM.
Putting together my new IKEA sit/stand desk. I also ordered a carpet for my office (it’s currently just got a laminate floor).
Writing about Look! Just Phone!

Next week I’m spending most of Monday with my family before heading off to London. I’ll be keynoting and running a workshop at the London College of Fashion conference on Tuesday. On Wednesday and Thursday I’ll be working from the City & Guilds offices, getting to know people and putting things into motion!

Image CC BY Kenny Louie

http://dougbelshaw.com/blog/2015/03/28/weeknote-13-2015/

Комментарии (0)

Gregory Szorc: Notes from Facebook's Developer Infrastructure at Scale F8 Talk

Суббота, 28 Марта 2015 г. 14:45 + в цитатник

Any time Facebook talks about technical matters I tend to listen. They have a track record of demonstrating engineering leadership in several spaces. And, unlike many companies that just talk, Facebook often gives others access to those ideas via source code and healthy open source projects. It's rare to see a company operating on the frontier of the computing field provide so much insight into their inner workings. You can gain so much by riding their cotails and following their lead instead of clinging to and cargo culting from the past.

The Facebook F8 developer conference was this past week. All the talks are now available online. I encourage you to glimpse through the list of talks and watch whatever is relevant to you. There's really a little bit for everyone.

Of particular interest to me is the Big Code: Developer Infrastructure at Facebook's Scale talk. This is highly relevant to my job role as Developer Productivity Engineer at Mozilla.

My notes for this talk follow.

"We don't want humans waiting on computers. We want computers waiting on humans." (This is the common theme of the talk.)

In 2005, Facebook was on Subversion. In 2007 moved to Git. Deployed a bridge so people worked in Git and had distributed workflow but pushed to Subversion under the hood.

New platforms over time. Server code, iOS, Android. One Git repo per platform/project -> 3 Git repos. Initially no code sharing, so no problem. Over time, code sharing between all repos. Lots of code copying and confusion as to what is where and who owns what.

Facebook is mere weeks away from completing their migration to consolidate the big three repos to a Mercurial monorepo. (See also my post about monorepos.)

Reasons:

Easier code sharing.
Easier large-scale changes. Rewrite the universe at once.
Unified set of tooling.

Facebook employees run >1M source control commands per day. >100k commits per week. VCS tool needs to be fast to prevent distractions and context switching, which slow people down.

Facebook implemented sparse checkout and shallow history in Mercurial. Necessary to scale distributed version control to large repos.

Quote from Google: "We're excited about the work Facebook is doing with Mercurial and glad to be collaborating with Facebook on Mercurial development." (Well, I guess the cat is finally out of the bag: Google is working on Mercurial. This was kind of an open secret for months. But I guess now it is official.)

Push-pull-rebase bottleneck: if you rebase and push and someone beats you to it, you have to pull, rebase, and try again. This gets worse as commit rate increases and people do needless legwork. Facebook has moved to server-side rebasing on push to mostly eliminate this pain point. (This is part of a still-experimental feature in Mercurial, which should hopefully lose its experimental flag soon.)

Starting 13:00 in we have a speaker change and move away from version control.

IDEs don't scale to Facebook scale. "Developing in Xcode at Facebook is an exercise in frustration." On average 3.5 minutes to open Facebook for iOS in Xcode. 5 minutes on average to index. Pegs CPU and makes not very responsive. 50 Xcode crashes per day across all Facebook iOS developers.

Facebook measures everything about tools. Mercurial operation times. Xcode times. Build times. Data tells them what tools and workflows need to be worked on.

Facebook believes IDEs are worth the pain because they make people more productive.

Facebook wants to support all editors and IDEs since people want to use whatever is most comfortable.

React Native changed things. Supported developing on multiple platforms, which no single IDE supports. People launched several editors and tools to do React Native development. People needed 4 windows to do development. That experience was "not acceptable." So they built their own IDE. Set of plugins on top of ATOM. Not a fork. They like hackable and web-y nature of ATOM.

The demo showing iOS development looks very nice! Doing Objective-C, JavaScript, simulator integration, and version control in one window!

It can connect to remote servers and transparently save and deploy changes. It can also get real-time compilation errors and hints from the remote server! (Demo was with Hack. Not sure if others langs supported. Having beefy central servers for e.g. Gecko development would be a fun experiment.)

Starting at 32:00 presentation shifts to continuous integration.

Number one goal of CI at Facebook is developer efficiency. We don't want developers waiting on computers to build and test diffs.

3 goals for CI:

High-signal feedback. Don't want developers chasing failures that aren't their fault. Wastes time.
Must provide rapid feedback. Developers don't want to wait.
Provide frequent feedback. Developers should know as soon as possible after they did something. (I think this refers to local feedback.)

Sandcastle is their CI system.

Diff lifecycle discussion.

Basic tests and lint run locally. (My understanding from talking with Facebookers is "local" often means on a Facebook server, not local laptop. Machines at developers fingertips are often dumb terminals.)

They appear to use code coverage to determine what tests to run. "We're not going to run a test unless your diff might actually have broken it."

They run flaky tests less often.

They run slow tests less often.

Goal is to get feedback to developers in under 10 minutes.

If they run fewer tests and get back to developers quicker, things are less likely to break than if they run more tests but take longer to give feedback.

They also want feedback quickly so reviewers can see results at review time.

They use Web Driver heavily. Love cross-platform nature of Web Driver.

In addition to test results, performance and size metrics are reported.

They have a "Ship It" button on the diff.

Landcastle handles landing diff.

"It is not OK at Facebook to land a diff without using Landcastle." (Read: developers don't push directly to the master repo.)

Once Landcastle lands something, it runs tests again. If an issue is found, a task is filed. Task can be "push blocking." Code won't ship to users until the "push blocking" issue resolved. (Tweets confirm they do backouts "fairly aggressively." A valid resolution to a push blocking task is to backout. But fixing forward is fine as well.)

After a while, branch cut occurs. Some cherry picks onto release branches.

In addition to diff-based testing, they do continuous testing runs. Much more comprehensive. No time restrictions. Continuous runs on master and release candidate branches. Auto bisect to pin down regressions.

Sandcastle processes >1000 test results per second. 5 years of machine work per day. Thousands of machines in 5 data centers.

They started with buildbot. Single master. Hit scaling limits of single thread single master. Master could not push work to workers fast enough. Sandcastle has distributed queue. Workers just pull jobs from distributed queue.

"High-signal feedback is critical." "Flaky failures erode developer confidence." "We need developers to trust Sandcastle."

Extremely careful separating infra failures from other failures. Developers don't see infra failures. Infra failures only reported to Sandcastle team.

Bots look for flaky tests. Stress test individual tests. Run tests in parallel with themselves. Goal: developers don't see flaky tests.

There is a "not my fault" button that developers can use to report bad signals.

"Whatever the scale of your engineering organization, developer efficiency is the key thing that your infrastructure teams should be striving for. This is why at Facebook we have some of our top engineers working on developer infrastructure." (Preach it.)

Excellent talk. Mozillians doing infra work or who are in charge of head count for infra work should watch this video.

Update 2015-03-28 21:35 UTC - Clarified some bits in response to new info Tweeted at me. Added link to my monorepos blog post.

http://gregoryszorc.com/blog/2015/03/28/notes-from-facebook's-developer-infrastructure-at-scale-f8-talk

Комментарии (0)

Benjamin Kerensa: Calling Out OkCupid

Суббота, 28 Марта 2015 г. 01:15 + в цитатник

So the other day, Indiana’s governor signed a bill into law that the Republican controlled legislature passed called the Religious Freedom Restoration Act. The reality of this bill is it has nothing to do with freedom of religion and everything to do with legalizing discrimination.

Anyways, to the point, I hate to open a can of worms but when I heard this news I thought back to this same time last year and remembered how gung ho OkCupid was over Mozilla’s appointment of Brendan Eich because of his personal beliefs and that they ultimately decided to block all Firefox users.

I don’t really think OkCupid should block Indiana but their lack of even a public tweet or statement in opposition of this legislation leads me back to my original conclusion that they were just riding the media train for their own benefit and not because they support the LGBT community.

If you are going to be about supporting the LGBT community, try to at least be consistent in that support and not just do it when it will make you look good in the media!

http://feedproxy.google.com/~r/BenjaminKerensaDotComMozilla/~3/k4TK_WV9XHA/calling-out-okcupid

Комментарии (0)

Air Mozilla: Testday Regional en Espa~nol

Пятница, 27 Марта 2015 г. 23:00 + в цитатник

Testday Regional en Espa~nol Espa~nol: El equipo de QA de Mozilla Hispano brindar'a a todos los usuarios de habla hispana una introducci'on al 'area de Control de Calidad, as'i...

https://air.mozilla.org/testday-regional-en-espanol/

Комментарии (0)

Mark Surman: MoFo March 2015 Board Meeting

Пятница, 27 Марта 2015 г. 22:44 + в цитатник

What’s happening at the Mozilla Foundation? This post contains the presentation slides from our recent Board Meeting, plus an audio interview that Matt Thompson did with me last week. It provides highlights from 2014, a brief summary of Mozilla’s 2015 plan and a progress report on what we’ve achieved over the past three months.

I’ve also written a brief summary of notes from the slides and interview below if you want a quick scan. These are also posted on the Webmaker blog.

What we did in 2014

Grew contributors and ground game. (10,077 active contributors total.)
Prototyped new Webmaker mobile product
Expanded community programs by 3x

Mozilla’s 2015 Plan

Mozilla-wide goals: grow long-term relationships that

https://commonspace.wordpress.com/2015/03/27/mozilla-foundation-march-2015-board-meeting/

Комментарии (0)

William Lachance: Perfherder update: Summary series drilldown

Пятница, 27 Марта 2015 г. 22:24 + в цитатник

Just wanted to give another quick Perfherder update. Since the last time, I’ve added summary series (which is what GraphServer shows you), so we now have (in theory) the best of both worlds when it comes to Talos data: aggregate summaries of the various suites we run (tp5, tart, etc), with the ability to dig into individual results as needed. This kind of analysis wasn’t possible with Graphserver and I’m hopeful this will be helpful in tracking down the root causes of Talos regressions more effectively.

Let’s give an example of where this might be useful by showing how it can highlight problems. Recently we tracked a regression in the Customization Animation Tests (CART) suite from the commit in bug 1128354. Using Mishra Vikas‘s new “highlight revision mode” in Perfherder (combined with the revision hash when the regression was pushed to inbound), we can quickly zero in on the location of it:

It does indeed look like things ticked up after this commit for the CART suite, but why? By clicking on the datapoint, you can open up a subtest summary view beneath the graph:

We see here that it looks like the 3-customize-enter-css.all.TART entry ticked up a bunch. The related test 3-customize-enter-css.half.TART ticked up a bit too. The changes elsewhere look minimal. But is that a trend that holds across the data over time? We can add some of the relevant subtests to the overall graph view to get a closer look:

As is hopefully obvious, this confirms that the affected subtest continues to hold its higher value while another test just bounces around more or less in the range it was before.

Hope people find this useful! If you want to play with this yourself, you can access the perfherder UI at http://treeherder.mozilla.org/perf.html.

http://wrla.ch/blog/2015/03/perfherder-update-summary-series-drilldown/

Комментарии (0)

Monty Montgomery: Today's WTF Moment: A Competing HEVC Licensing Pool

Пятница, 27 Марта 2015 г. 21:35 + в цитатник

Had this happened next week, I'd have thought it was an April Fools' joke.

Out of nowhere, a new patent licensing group just announced it has formed a second, competing patent pool for HEVC that is independent of MPEG LA. And they apparently haven't decided what their pricing will be... maybe they'll have a fee structure ready in a few months.

Video on the Net (and let's be clear-- video's future is the Net) already suffers endless technology licensing problems. And the industry's solution is apparently even more licensing.

In case you've been living in a cave, Google has been trying to establish VP9 as a royalty- and strings-free alternative (new version release candidate just out this week!), and NetVC, our own next-next-generation royalty-free video codec, was just conditionally approved as an IETF working group on Tuesday and we'll be submitting our Daala codec as an input to the standardization process. The biggest practical question surrounding both efforts is 'how can you possibly keep up with the MPEG behemoth'?

Apparently all we have to do is stand back and let the dominant players commit suicide while they dance around Schroedinger's Cash Box.

http://xiphmont.livejournal.com/66047.html

Комментарии (0)

Patrick McManus: Opportunistic Encryption For Firefox

Пятница, 27 Марта 2015 г. 19:06 + в цитатник

Firefox 37 brings more encryption to the web through opportunistic encryption of some http:// based resources. It will be released the week of March 31st.

OE provides unauthenticated encryption over TLS for data that would otherwise be carried via clear text. This creates some confidentiality in the face of passive eavesdropping, and also provides you much better integrity protection for your data than raw TCP does when dealing with random network noise. The server setup for it is trivial.

These are indeed nice bonuses for http:// - but it still isn't as nice as https://. If you can run https you should - full stop. Don't make me repeat it :) Only https protects you from active man in the middle attackers.

But if you have long tail of legacy content that you cannot yet get migrated to https, commonly due to mixed-content rules and interactions with third parties, OE provides a mechanism for an encrypted transport of http:// data. That's a strict improvement over the cleartext alternative.

Two simple steps to configure a server for OE

Install a TLS based h2 or spdy server on a separate port. 443 is a good choice :). You can use a self-signed certificate if you like because OE is not authenticated.
Add a response header Alt-Svc: h2=":443" or spdy/3.1 if you are using a spdy enabled server like nginx.

When the browser consumes that response header it will start to verify the fact that there is a HTTP/2 service on port 443. When a session with that port is established it will start routing the requests it would normally send in cleartext to port 80 onto port 443 with encryption instead. There will be no delay in responsiveness because the new connection is fully established in the background before being used. If the alternative service (port 443) becomes unavailable or cannot be verified Firefox will automatically return to using cleartext on port 80. Clients that don't speak the right protocols just ignore the header and continue to use port 80.

This mapping is saved and used in the future. It is important to understand that while the transaction is being routed to a different port the origin of the resource hasn't changed (i.e. if the cleartext origin was http://www.example.com:80 then the origin, including the http scheme and the port 80, are unchanged even if it routed to port 443 over TLS). OE is not available with HTTP/1 servers because that protocol does not carry the scheme as part of each transaction which is a necessary ingredient for the Alt-Svc approach.

You can control some details about how long the Alt-Svc mappings last and some other details. The Internet-Draft is helpful as a reference. As the technology matures we will be tracking it; the recent HTTP working group meeting in Dallas decided this was ready to proceed to last call status in the working group.

http://bitsup.blogspot.com/2015/03/opportunistic-encryption-for-firefox.html

Комментарии (0)

Geoffrey MacDougall: Me+Next

Пятница, 27 Марта 2015 г. 18:37 + в цитатник

I’ve let my long-time friend and Executive Director, Mark Surman, know that April 10th will be my last day as an employee of Mozilla.

The last 5 years have been an amazing ride. I’m proud of what we’ve accomplished.

There are learning, fundraising, and advocacy programs where there weren’t before. We’re empowering hundreds of thousands of people to teach each other the web. We’ve built a $15M/y fundraising program from scratch. And we’ve helped Mozilla find its voice again, playing a lead role in the most significant grassroots policy victory in a generation and the largest ever in telecommunications: the battle for net neutrality.

I’m grateful to Mark and Mitchell Baker for the opportunity and trust to help build something great, to my colleagues for their focus and dedication, and to all of Mozilla for fighting the good fight.

While the 10th will be my last day as an employee, I’ll be around until the end of June as a consultant, helping with the transition of my portfolio to new leadership. I’ll announce my new home closer to that time.

For now, as always, once a Mozillian always a Mozillian.

Thanks again to all of you. I’m looking forward to seeing what we accomplish next.

Filed under: Mozilla

http://intangible.ca/2015/03/27/menext/

Комментарии (0)

Mike Conley: Things I’ve Learned This Week (March 23 – 27, 2015)

Пятница, 27 Марта 2015 г. 18:15 + в цитатник

“Things I learned this week” is my favorite section of our weekly team meeting.

— Margaret Leibovic (@mleibovic) March 20, 2015

This is the first post in a weekly series, where I’m going to attempt to distill down my week into some lessons or facts I’ve picked up. Maybe they’ll be interesting to others. We’ll see.

Gecko Media Plugins are used both for WebRTC (the Open H.264 encoding stuff runs inside a GMP), and is also going to be used to hold CDM’s for EME’s. That’s a lot of TLA’s!¹
This little notch I saw on the caret on my development build was because I had bidi.browser.ui set to true for some reason. It’s the “bidi caret”:
People hacking on platform are supposed to avoid using the NS_ENSURE_* macros, according to this.² I originally learned this by reading cpearce’s review of a patch.

So let’s see if I can keep this up for a few weeks. Maybe I’ll get a collection of useful stuff by the end of the experiment!

Three Letter Acronyms

http://mikeconley.ca/blog/2015/03/27/things-ive-learned-this-week-march-23-27-2015/

Комментарии (0)

Mike Conley: The Joy of Coding (Episode 7): Code review, and a Regression

Пятница, 27 Марта 2015 г. 18:03 + в цитатник

In this episode, I started with some code review. I was reviewing a patch to make the Findbar (particularly, the Find As You Type feature) e10s-friendly.

With that review out of the way, I had to swap a bunch of information about the plugin crash UI for e10s in my head – and in particular, some non-determinism that we have to handle. I explained that stuff (and hopefully didn’t spend too much time on it).

Then, I showed how far I’d gotten with the plugin crash UI for e10s. I was able to submit a crash report, but I found I wasn’t able to type into the comment text area.

After a while, I noticed that I couldn’t type into the comment text area on Nightly, even without my patch. And then I reproduced it in Aurora. And then in Beta. Luckily, I couldn’t reproduce it in Release – but with Beta transitioning to Release in only a few days, I didn’t have a lot of time to get a bug on file to shine some light on it.

Luckily, our brilliant Steven Michaud was on the case, and has just landed a patch to fix this. Talk about fast work!

Episode Agenda

References:
Bug 1133981 – [e10s] Stop sending unsafe CPOWs after the findbar has been closed in a remote browser

Bug 1110887 – With e10s, plugin crash submit UI is broken – Notes

Bug 1147521 – Cannot type into comment area of plugin crash UI

http://mikeconley.ca/blog/2015/03/27/the-joy-of-coding-episode-7-code-review-and-a-regression/

Комментарии (0)

Mozilla Reps Community: Reps Weekly Call – March 26th 2015

Пятница, 27 Марта 2015 г. 16:02 + в цитатник

Last Thursday we had our weekly call about the Reps program, where we talk about what’s going on in the program and what Reps have been doing during the last week.

fossasia-2015

Summary

MozBalkans Applications.
Changes on Event and Report forms.
Community Education Update.
QA News and Events.
Firefox App Training.
HackOnMDN at Berlin-March-2015.

Detailed notes

AirMozilla video

Don’t forget to comment about this call on Discourse and we hope to see you next week!

https://blog.mozilla.org/mozillareps/2015/03/27/reps-weekly-call-march-26th-2015/

Комментарии (0)

Doug Belshaw: Today is my last day at Mozilla

Пятница, 27 Марта 2015 г. 09:50 + в цитатник

TL;DR: I’m leaving Mozilla as a paid contributor because, as of next week, I’ll be a full-time consultant! I’ll write about that in a separate blog post.

Around four years ago, I stumbled across a project that the Mozilla Foundation was running with P2PU. It was called ‘Open Badges’ and it really piqued my interest. I was working in Higher Education at the time and finishing off my doctoral thesis. The prospect of being able to change education by offering a different approach to credentialing really intrigued me.

I started investigating further, blogging about it, and started getting more people interested in the Open Badges project. A few months later, the people behind MacArthur’s Digital Media and Learning (DML) programme asked me to be a judge for the badges-focused DML Competition. While I was in San Francisco for the judging process I met Erin Knight, then Director of Learning at Mozilla, in person. She asked if I was interested in working on her team. I jumped at the chance!

During my time at Mozilla I’ve worked on Open Badges, speaking and running keynotes at almost as many events as there are weeks in the year. I’ve helped bring a Web Literacy Map (originally ‘Standard’) into existence, and I’ve worked on various projects and with people who have changed my outlook on life. I’ve never come across a community with such a can-do attitude.

This June would have marked three years as a paid contributor to the Mozilla project. It was time to move on so as not to let the grass grow under my feet. Happily, because Mozilla is a global non-profit with a strong community that works openly, I’ll still be a volunteer contributor. And because of the wonders of the internet, I’ll still have a strong connection to the network I built up over the last few years.

I plan to write more about the things I learned and the things I did at Mozilla over the coming weeks. For now, I just want to thank all of the people I worked with over the past few years, and wish them all the best for the future. As of next week I’ll be a full-time consultant. More about that in an upcoming post!

http://dougbelshaw.com/blog/2015/03/27/last-day-at-mozilla/

Комментарии (0)

**Mike Taylor: Pasting into contenteditable elements in Firefox for Android, ~wowowowowow~**

Пятница, 27 Марта 2015 г. 08:00 + в цитатник

Bug 783846 landing in Nightly means that Firefox for Android users—starting in version 39—can finally paste into contenteditable elements, which is huge news for the mobile-html5-responsive-shadow-and-or-virtual-dom-contenteditable apps crowd, developers and users alike.

"That's amazing! Can I cut text from contenteditable elements too?", you're asking yourself. The answer is um no, because we haven't fixed that yet. But if you wanna help me get that working come on over to bug 1112276 and let's party. Or write some JS and fix the bug or whatever.

Fun fact, when I told my manager I was working on this bug in my spare time he asked, "…Why?".

https://miketaylr.com/posts/2015/03/contenteditable-paste.html

Комментарии (0)

Nicholas Nethercote: On vacation for a month

Пятница, 27 Марта 2015 г. 06:14 + в цитатник

I’m taking a month of vacation. Today is my last working day for March, and I will be back on April 30th. While I won’t be totally incommunicado, for the most part I won’t be reading email. While I’m gone, any management-type inquiries can be passed on to Naveed Ihsannullah.

https://blog.mozilla.org/nnethercote/2015/03/27/on-vacation-for-a-month/

Комментарии (0)

Air Mozilla: VR Cinema Meetup #3

Пятница, 27 Марта 2015 г. 05:00 + в цитатник

VR Cinema Meetup #3 VR Cinema's third event will showcase new and exciting films for virtual reality. Come see these groundbreaking projects before they're released to the public and...

https://air.mozilla.org/vr-cinema-meetup-3/

Комментарии (0)

Morgan Phillips: Whoop, Whoop: Pull Up!

Пятница, 27 Марта 2015 г. 02:45 + в цитатник

Since December 1^st 1975, by FAA mandate, no plane has been allowed to fly without a "Ground Proximity Warning System" GPWS (or one of its successors).^[1] For good reason too, as it's been figured that 75% of the fatalities just one year prior (1974) could have been prevented using the system.^[2]

In a slew of case studies, reviewers reckoned that a GPWS may have prevented crashes by giving pilots additional time to act before they smashed into the ground. Often, the GPWS's signature "Whoop, Whoop: Pull Up!" would have sounded a full fifteen seconds before any other alarms triggered.^[3]

Instruments like this are indispensable to aviation because pilots operate in an environment outside of any realm where human intuition is useful. Lacking augmentation, our bodies and minds are simply not suited to the task of flying airliners.

For the same reason, thick layers of instrumentation and early warning systems are necessary for managing technical infrastructure. Like pilots, without proper tooling, system administrators often plow their vessels into the earth....

The St. Patrick's Day Massacre
Case in point, on Saint Patrick's Day we suffered two outages which could have likely been avoided via some additional alerts and a slightly modified deployment process.

The first outage was caused by the accidental removal of a variable from a config file which one of our utilities depends on. Our utilities are all managed by a dependency system called runner, and when any task fails the machine is prevented from doing work until it succeeds. This all-or-nothing behavior is correct, but should not lead to closed trees....

On our runner dashboards, the whole event looked like this (the smooth decline on the right is a fix being rolled out with ansible):

The second, and most severe, outage was caused by an insufficient wait time between retries upon failing to pull from our mercurial repositories.

There was a temporary disruption in service, and a large number of slaves failed to clone a repository. When this herd of machines began retrying the task it became the equivalent of a DDoS attack.

From the repository's point of view, the explosion looked like this:

Then, from runner's point of view, the retrying task:

In both of these cases, despite having the data (via runner logging), we missed the opportunity to catch the problem before it caused system downtime. Furthermore, especially in the first case, we could have avoided the issue even earlier by testing our updates and rolling them out gradually.

Avoiding Future Massacres
After these fires went out, I started working on a RelEng version of the Ground Proximity Warning System, to keep us from crashing in the future. Here's the plan:

1.) Bug 1146974 - Add automated alerting for abnormally high retries (in runner).

In both of the above cases, we realized that things had gone amiss based on job backlog alerts. The problem is, once we have a large enough backlog to trigger those alarms, we're already hosed.

The good news is, the backlog is preceded by a spike in runner retries. Setting up better alerting here should buy us as much as an extra hour to respond to trouble.

We're already logging all task results to influxdb, but, alerting via that data requires a custom nagios script. Instead of stringing that together, I opted to write runner output to syslog where it's being aggregated by papertrail.

Using papertrail, I can grep for runner retries and build alarms from the data. Below is a screenshot of our runner data in the papertrail dashboard:

2.) Add automated testing, and tiered roll-outs to golden ami generation

Finally, when we update our slave images the new version is not rolled out in a precise fashion. Instead, as old images die (3 hours after the new image releases) new ones are launched on the latest version. Because of this, every deploy is an all-or-nothing affair.

By the time we notice a problem, almost all of our hosts are using the bad instance and rolling back becomes a huge pain. We also do rollbacks by hand. Nein, nein, nein.

My plan here is to launch new instances with a weighted chance of picking up the latest ami. As we become more confident that things aren't breaking -- by monitoring the runner logs in papertrail/influxdb -- we can increase the percentage.

The new process will work like this:

Lastly, if we want to roll back, we can just lower the percentage down to zero while we figure things out. This also means that we can create sanity checks which roll back bad amis without any human intervention whatsoever.

The intention being, any failure within the first 90 minutes will trigger a rollback and keep the doors open....

http://linux-poetry.com/blog/section/mozilla/19/

Комментарии (0)

Daniel Pocock: WebRTC: DruCall in Google Summer of Code 2015?

Пятница, 27 Марта 2015 г. 00:58 + в цитатник

I've offered to help mentor a Google Summer of Code student to work on DruCall. Here is a link to the project details.

The original DruCall was based on SIPml5 and released in 2013 as a proof-of-concept.

It was later adapted to use JSCommunicator as the webphone implementation. JSCommunicator itself was updated by another GSoC student, Juliana Louback, in 2014.

It would be great to take DruCall further in 2015, here are some of the possibilities that are achievable in GSoC:

Updating it for Drupal 8
Support for logged-in users (currently it just makes anonymous calls, like a phone box)
Support for relaying shopping cart or other session cookie details to the call center operative who accepts the call

Help needed: could you be a co-mentor?

My background is in real-time and server-side infrastructure and I'm providing all the WebRTC SIP infrastructure that the student may need. However, for the project to have the most impact, it would also be helpful to have some input from a second mentor who knows about UI design, the Drupal way of doing things and maybe some Drupal 8 experience. Please contact me ASAP if you would be keen to participate either as a mentor or as a student. The deadline for student applications is just hours away but there is still more time for potential co-mentors to join in.

WebRTC at mini-DebConf Lyon in April

The next mini-DebConf takes place in Lyon, France on April 11 and 12. On the Saturday morning, there will be a brief WebRTC demo and there will be other opportunities to demo or test it and ask questions throughout the day. If you are interested in trying to get WebRTC into your web site, with or without Drupal, please see the RTC Quick Start guide.

http://danielpocock.com/drucall-in-gsoc-2015

Комментарии (0)

Armen Zambrano: mozci 0.4.0 released - Many bug fixes and improved performance

Четверг, 26 Марта 2015 г. 23:36 + в цитатник

For the release notes with all there hyper-links go here.

NOTE: I did a 0.3.1 release but the right number should have been 0.4.0

This release does not add any major features, however, it fixes many issues and has much better performance.

Many thanks to @adusca, @jmaher and @vaibhavmagarwal for their contributions.

Features:

An alltalos.py script has been added
Issue #69 - Generate graph of builds to testers
Added flake8 support - Remove pyflakes and pep8
Allow skipping revisions on a list (09f7138)
Issue #61 - Rename trigger_range.py to trigger.py

Fixes:

All the documentation and roadmap have been polished
Issue #90 - Do not trigger builds multiple times if we are intending the test jobs to be triggered multiple times
Issue #94 - Load list of repositories from disk only once
Issue #117 - gaia-try builders are always upstream builders
Determine a running job correctly (068b5ee)
Issue #142 - Loading buildjson files from disk is now only done once
Issue #135 - Remove buildjson files which have fallen out of date
Issue #146 - If the buildapi information about a build is corrupted, trigger that build again
Some DONTBUILD pushes can have buildapi support (dcb942f)
Issue #120 - Prevent triggering more build jobs than necessary

For all changes visit: 0.3.0...0.4.0

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

http://feedproxy.google.com/~r/armenzg_mozilla/~3/qQLLmvyisik/mozci-040-released-many-bug-fixes-and.html

Комментарии (0)

Air Mozilla: Flowbox.io & Luna

Четверг, 26 Марта 2015 г. 23:00 + в цитатник

Flowbox.io & Luna Woyciech Danilo from flowbox.io talks about their new programming language, Luna. Flowbox develops professional video compositing software, which is powered by a new programming language...

https://air.mozilla.org/flowbox-io-luna/

LiveInternetLiveInternet

-Поиск по дневнику

-Подписка по e-mail

-Постоянные читатели

-Статистика

Planet Mozilla

Перейти к полной формеБыстрая запись

Doug Belshaw: Weeknote 13/2015

Mozilla

Dynamic Skillset

Other

Gregory Szorc: Notes from Facebook's Developer Infrastructure at Scale F8 Talk

Benjamin Kerensa: Calling Out OkCupid

Air Mozilla: Testday Regional en Espa~nol

Mark Surman: MoFo March 2015 Board Meeting

What we did in 2014

Mozilla’s 2015 Plan

William Lachance: Perfherder update: Summary series drilldown

Monty Montgomery: Today's WTF Moment: A Competing HEVC Licensing Pool

Patrick McManus: Opportunistic Encryption For Firefox

Geoffrey MacDougall: Me+Next

Mike Conley: Things I’ve Learned This Week (March 23 – 27, 2015)

Mike Conley: The Joy of Coding (Episode 7): Code review, and a Regression

Mozilla Reps Community: Reps Weekly Call – March 26th 2015

Summary

AirMozilla video

Doug Belshaw: Today is my last day at Mozilla

**Mike Taylor: Pasting into contenteditable elements in Firefox for Android, ~wowowowowow~**

Nicholas Nethercote: On vacation for a month

Air Mozilla: VR Cinema Meetup #3

Morgan Phillips: Whoop, Whoop: Pull Up!

Daniel Pocock: WebRTC: DruCall in Google Summer of Code 2015?

Help needed: could you be a co-mentor?

WebRTC at mini-DebConf Lyon in April

Armen Zambrano: mozci 0.4.0 released - Many bug fixes and improved performance

Air Mozilla: Flowbox.io & Luna