Chris Cooper: Releng & Relops weekly highlights - June 12, 2015

Понедельник, 15 Июня 2015 г. 20:04 + в цитатник

Happy Monday!

Release Engineering has a lot going on. To help spread good news and keep everyone informed, we’re trying an experiment in communication.

Managers put together a list of what we all have been working on, highlighting wins in the last week or so. That list gets sent out to the public release-engineering mailing list and then gets reblogged here.

Please send feedback - did you learn anything, what else should we have included, and what topics might need some additional explanation or context.

tl;dr

Taskcluster: Morgan now has working opt, debug, PGO and ASan 64-bit Linux builds for TaskCluster. This work enables developers to experiment with linux try jobs on their local systems! Dustin debugged and confirmed that inter-region S3 transfers are capped at 1 MB/sec; he also stood up a relengapi proxy for accessing private files in tooltool. Rail just finished work on deploying signing workers for TaskCluster, important for deploying funsize.

Puppetized Windows in AWS: Mark debugged and worked around an annoying ACL and netsh bugs in puppet that were blocking forward progress on Windows puppetization. Amy and Rob are generating 2008R2 puppetized AMIs in AWS via cloud-tools.

Operational

Amy, Hal, Nick, Ben, and Kim responded to the many issues caused by the NetApp outage, including unexpected buildbot database corruption. Hal, Ben, Nick and Rail have been working hard to enable 38.0.6, in support of the spring release which included a bunch of features we needed to get out the door to users.

Q worked on making S3 uploads from Windows complete successfully and worked with Sheriffs to debug and fix a Start Screen problem on Windows 8 that was causing test failures. Jake is saving us money by retiring diamond (shout out to shutting things off!). Rob & Q ensured that all systems now send logs to Papertrail, EVEN XP!

Mike removed the one of the last blockers to getting off FTP and over to S3: porting Android buildsto mozharness. In addition to freeing us from the buildbot factories, this also uses TaskCluster’s index service for uploading artifacts to S3. Hal and Anhad (our amazing intern!) now have vcs-sync conversions all running in parallel from AWS. As part of moving mozharness in-tree, Jordan is doing to work allow consumers to create and fetch mozharness bundles automatically from relengapi.

Thank you all!

…

And here are all the details:

Taskcluster

Morgan got ASan and debug Firefox builds running in Task Cluster with very little additional effort on top of her previous work with opt builds. Optimized and PGO builds are already working well. Also, this work enables developers to run try jobs on a local system! See: http://linuxpoetry.com/blog/22/
Dustin worked with the AWS support team to determine/verify that S3 is capping transfers between regions at 1MB/sec using their single-stream copy mechanism (bug 1167732). This pinpoints a bottleneck in our CI infrastructure where we can instead use their multi-stream copy utility and increase performance.
Dustin wrote a relengapi proxy (bug 1170753) to allow Android builds access to private files via Releng’s existing tooltool application (which he also rewrote recently). This puts us one step closer to being able to build Android using TaskCluster, the new CI tool developed by the FxOS Releng team. He’ll be working on a more generic proxy implementation once we are further along with the Android builds.
Rail finished work deploying the new signing workers for TaskCluster. These will be used initially for funsize to sign partial update MARs, and will be used in the future to sign nightly and release builds as well.

Puppetized Windows in AWS

We’ve been facing issues with the interaction between puppet and Windows 2008 that led to puppet runs failing after some number of executions. After some detailed debugging, Mark has identified that the Windows file module attempts to append permissions instead of replacing them when puppet runs (bug 1170587). This means we eventually ran out of space in the ACL definition and puppet failed. For now, he’s come up with a workaround that clears the ACL and the applies them fresh on each run where they need to change. He’ll be filing a bug with the puppet folks, but this allows us to move forward on our work to manage WIndows 2008 build machines with puppet instead of GPO. There was another puppet bug/bad interaction with netsh exit statuses (bug 1165567) that Mark has a workaround for so we can deploy the same TCP stack tuning that we did with GPO.
Rob and Amy have been modifying cloud-tools and now have a process that’s generating puppetized AMIs for 2008R2 (bug 1166448). Next up is to validate that we get successful machine instantiations off of those AMIs and get some working builds off of those machines in try.

Operational

Amy, Hal, Nick, Ben, and Kim performed emergency work in conjunction with IT and sheriffs during a Netapp storage outage that brought down releng VMWare systems and the buildbot database (as well as several other company services). Once they repaired the affected systems and brought them back online, they verified that services were functional and the trees could be opened again (bug 1172666).
Rob and Q finished up the last of the powershell and GPO scripting to add nxlog forwarding to our WIndows XP hosts (bug 1097374). This modification to the XP systems sends all system errors to our consolidated logging infrastructure. Adding this logging to XP means that all release engineering systems now report to the centralized system where we can search, alert, and triage issues with the systems where we would not have previously had insight. As a result, we’ll now be able to find and fix systems issues faster.
After some heavy debugging of timeouts during Windows 2008 uploads to S3 from our SCL3 datacenter during two tree closures, we think we’ve found the sweet spot of Windows TCP stack stability and performance. The systems hit an unexpected cascading performance failure after latency between the datacenter and S3 pushed us over an unknown/undocumented threshold. Fortunately, we had experienced similar issues in our prototype infrastructure in EC2 and Q had solved the longstanding EC2 -> S3 upload speed problem just days before (bug 1165314). The conditions in the datacenter were slightly different than our EC2 test-bed and he caught an additional stability issue after rolling out the first fix (bug 1168812). The second fix he deployed seems to have everything working well again and improved our performance in EC2 significantly.
Jake is saving us money and network/CPU usage by retiring the use of diamond (which was reporting to hostedgraphite.com) on our EC2 Linux instances (bug 1164220). We’re now relying on collectd to report statistics to the internal graphite infrastructure instead of also sending the same stats to hostedgraphite. Jake is in the process of compiling updated/patched collectd packages for all of our platforms which we’ll roll out soon to fix some known issues with reporting (bug 1157337)
In cooperation with the sheriff’s, Q has rolled out a change to our Windows 8 test machines that we believe will solve a number of timeout failures across different Firefox test suites. In these test cases, the diagnosis via screenshot was that the Start Screen was active when it should not have been and that the tests were not getting the correct feedback. The change turns off the Start Screen and, because it might also be impacting the same tests, Windows 8 Hot Corners (bug 1169243).
The “spring release” aka 38.0.5 and followups have needed some special attention. Hal, Ben, Nick and Rail have been doing a lot of work over the past weeks handling “What’s New” problems with Aurora, as well as the unscheduled 38.0.6 release. This was an important release to fix up some fallout caused by the funnelcake 38.0.5 experiment. If you’re like Selena, you might not know that funnelcake is our code name for special partner builds that have a look at user conversion for things like download, install and update.
Mike has switched Android builds over to use mozharness instead of the legacy buildbot factories. This was one of the last blockers for our migration off of FTP to S3. The new mozharness based builds upload to S3 using task cluster’s index service.
Hal and Anhad (our intern) have made good progress on moving vcs-sync off end-of-life hardware. All conversions are now running in parallel from AWS!
Jordan is working on adding endpoints to relengapi that will allow consumers to create and fetch mozharness bundles automatically. This is an essential precursor to having mozharness completely managed in-tree and reducing hands-on interaction from releng to zero.