Mozilla Releng: The ice cream

>> Wednesday, September 17, 2014

A week or so ago, I was commenting in IRC that I was really impressed that our interns had such amazing communication and presentation skills.  One of the interns, John Zeller said something like "The cream rises to the top", to which I replied "Releng: the ice cream of CS".  From there, the conversation went on to discuss what would be the best ice cream flavour to make that would capture the spirit of Mozilla releng.  The consensus at the end was was that Irish Coffee (coffee with whisky) with cookie dough chunks was the favourite.  Because a lot of people like on the team like coffee, whisky makes it better and who doesn't like cookie dough?

I made this recipe over the weekend with some modifications.  I used the coffee recipe from the Perfect Scoop.  After it was done churning in the ice cream maker,  instead of whisky, which I didn't have on hand, I added Kahlua for more coffee flavour.  I don't really like cookie dough in ice cream but cooked chocolate chip cookies cut up with a liberal sprinkling of Kahlua are tasty.

Diced cookies sprinkled with Kahlua

Ice cream ready to put in freezer

Finished product
I have to say, it's quite delicious :-) If I open source ever stops being fun, I'm going to start a dairy empire.  Not really. Now back to bugzilla...

Read more...

Mozilla pushes - August 2014

>> Wednesday, September 10, 2014

Here's August 2014's monthly analysis of the pushes to our Mozilla development trees.  You can load the data as an HTML page or as a json file.



Trends
It was another record breaking month.  No surprise here!

Highlights

  • 13090 pushes
    • new record
  • 422 pushes/day (average)
    • new record
  • Highest number of pushes/day: 690 pushes on August 20.  This same day corresponded with our first day where we ran over 100,000 test jobs.
    • new record
  • 23.12 pushes/hour (average)

General Remarks
Both Try and Gaia-Try have about 36% each of the pushes.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes.


Records
August 2014 was the month with most pushes (13,090  pushes)
August 2014 has the highest pushes/day average with 620 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
August 20, 2014 had the highest number of pushes in one day with 690 pushes






Read more...

Mozilla pushes - July 2014

>> Friday, August 08, 2014

Here's the July 2014 monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.
 
Trends
Like every month for the past while, we had a new record number of pushes. In reality, given that July is one day longer than June, the numbers are quite similar.

Highlights

  • 12,755 pushes
    • new record
  •  411 pushes/day (average)
  • Highest number of pushes/day: 625 pushes on July 3, 2014
  • Highest 23.51 pushes/hour (average)
    • new record

General remarks
Try keeps on having around 38% of all the pushes. Gaia-Try is in second place with around 31% of pushes.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes.

Records 
July 2014 was the month with most pushes (12,755 pushes)
June 2014 has the highest pushes/day average with 662 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
June 4th, 2014 had the highest number of pushes in one day with 662 
 

Read more...

Scaling mobile testing on AWS

>> Thursday, August 07, 2014

Running tests for Android at Mozilla has typically meant running on reference devices.  Physical devices that run jobs on our continuous integration farm via test harnesses.  However, this leads to the same problem that we have for other tests that run on bare metal.  We can't scale up our capacity without going buying new devices, racking them, configuring them for the network and updating our configurations.  In addition, reference cards, rack mounted or not, are rather delicate creatures and have higher retry rates (tests fail due to infrastructure issues and need to be rerun) than those running on emulators (tests run on an Android emulator in a VM on bare metal or cloud)

Do Android's Dream of Electric Sheep?  ©Bill McIntyre, Creative Commons by-nc-sa 2.0
Recently, we started running Android 2.3 tests on emulators in AWS.  This works well for unit tests (correctness tests).  It's not really appropriate for performance tests, but that's another story.  This impetus behind this change was so we could decommission Tegras, the reference devices we used for running Android 2.2 tests. 

We run many Linux based tests, including Android emulators on AWS spot instances.  Spot instances are AWS excess capacity that you can bid on.  If someone outbids the price you have paid for your spot instance, you instance can be terminated.  But that's okay because we retry jobs if they fail for infrastructure reasons.  The overall percentage of spot instances that are terminated is quite small.  The huge advantage to using spot instances is price.  They are much cheaper than on-demand instances which has allowed us to increase our capacity while continuing to reduce our AWS bill

We have a wide variety of unit tests that run on emulators for mobile on AWS.  We encountered an issue where some of the tests wouldn't run on the default instance type (m1.medium), that we use for our spot instances.   Given the number of jobs we run, we want to run on the cheapest AWS instance type that where the tests will complete successfully.  At the time we first tested it, we couldn't find an instance type where certain CPU/memory intensive tests would run.  So when I first enabled Android 2.3 tests on emulators, I separated the tests so that some would run on AWS spot instances and the ones that needed a more powerful machine would run on our inhouse Linux capacity.  But this change consumed all of the capacity of that pool and we had very high number of pending jobs in that pool.  This meant that people had to wait a long time for their test results.  Not good.

To reduce the pending counts, we needed to buy some more in house Linux capacity or try to run a selected subset of the tests that need more resources or find a new AWS instance type where they would complete successfully.  Geoff from the ATeam ran the tests on the c3.xlarge instance type he had tried before and now it seemed to work.  In his earlier work the tests did not complete successfully on this instance type.  We are unsure as to the reasons why.  One of the things about working with AWS is that we don't have a window into the bugs that they fix at their end.  So this particular instance type didn't work before, but it does now.

The next steps for me were to create a new AMI (Amazon machine image) that would serve as as the "golden" version for instances that would be created in this pool.  Previously, we used Puppet to configure our AWS test machines but now just regenerate the AMI every night via cron and this is the version that's instantiated.  The AMI was a copy of the existing Ubuntu64 image that we have but it was configured to run on the c3.xlarge instance type instead of m1.medium. This was a bit tricky because I had to exclude regions where the c3.xlarge instance type was not available.  For redundancy (to still have capacity if an entire region goes down) and cost (some regions are cheaper than others), we run instances in multiple AWS regions

Once I had the new AMI up that would serve as the template for our new slave class, I created a slave with the AMI and verified running the tests we planned to migrate on my staging server.  I also enabled two new Linux64 buildbot masters in AWS to service these new slaves, one in us-east-1 and one in us-west-2.  When enabling a new pool of test machines, it's always good to look at the load on the current buildbot masters and see if additional masters are needed so the current masters aren't overwhelmed with too many slaves attached.

After the tests were all green, I modified our configs to run this subset of tests on a branch (ash), enabled the slave platform in Puppet and added a pool of devices to this slave platform in our production configs.  After the reconfig deployed these changes into production, I landed a regular expression to watch_pending.cfg to so that new tst-emulator64-spot pool of machines would be allocated to the subset of tests and branch I enabled them on. The watch_pending.py script watches the number of pending jobs that on AWS and creates instances as required.  We also have scripts to terminate or stop idle instances when we don't get them.  Why pay for machines when you don't need them now?  After the tests ran successfully on ash, I enabled running the tests on the other relevant branches.

Royal Border Bridge.  Also, release engineers love to see green builds and tests.  ©Jonathan Combe, Creative Commons by-nc-sa 2.0
The end result is that some Android 2.3 tests run on m1.medium or (tst-linux64-spot instances), such as mochitests.



And some Android 2.3 tests run on c3.xlarge or (tst-emulator64-spot instances), such as crashtests.

 

In enabling this slave class within our configs, we were also able to reuse it for some b2g tests which also faced the same problem where they needed a more powerful instance type for the tests to complete.

Lessons learned:
Use the minimum (cheapest) instance type required to complete your tests
As usual, test on a branch before full deployment
Scaling mobile tests doesn't mean more racks of reference cards

Future work:
Bug 1047467 c3.xlarge instance types are expensive, let's test running those tests on a range of instance types that are cheaper

Further reading:
AWS instance types 
Chris Atlee wrote about how we Now Use AWS Spot Instances for Tests
Taras Glek wrote How Mozilla Amazon EC2 Usage Got 15X Cheaper in 8 months
Rail Aliiev http://rail.merail.ca/posts/firefox-builds-are-way-cheaper-now.html 
Bug 980519 Experiment with other instance types for Android 2.3 jobs 
Bug 1024091 Address high pending count in in-house Linux64 test pool 
Bug 1028293 Increase Android 2.3 mochitest chunks, for aws 
Bug 1032268 Experiment with c3.xlarge for Android 2.3 jobs
Bug 1035863 Add two new Linux64 masters to accommodate new emulator slaves
Bug 1034055 Implement c3.xlarge slave class for Linux64 test spot instances
Bug 1031083 Buildbot changes to run selected b2g tests on c3.xlarge
Bug 1047467 c3.xlarge instance types are expensive, let's try running those tests on a range of instance types that are cheaper

Read more...

2014 USENIX Release Engineering Summit CFP now open

>> Monday, July 28, 2014

The CFP for the 2014 Release Engineering summit (Western edition) is now open.  The deadline for submissions is September 5, 2014 and speakers will be notified by September 19, 2014.  The program will be announced in late September.  This one day summit on all things release engineering will be held in concert with LISA, in Seattle on November 10, 2014. 

Seattle skyline © Howard Ignatius, https://flic.kr/p/6tQ3H Creative Commons by-nc-sa 2.0


From the CFP


"Suggestions for topics include (but are not limited to):
  • Best practices for release engineering
  • Practical information on specific aspects of release engineering (e.g., source code management, dependency management, packaging, unit tests, deployment)
  • Future challenges and opportunities in release engineering
  • Solutions for scalable end-to-end release processes
  • Scaling infrastructure and tools for high-volume continuous integration farms
  • War and horror stories
  • Metrics
  • Specific problems and solutions for specific markets (mobile, financial, cloud)
URES '14 West is looking for relevant and engaging speakers and workshop facilitators for our event on November 10, 2014, in Seattle, WA. URES brings together people from all areas of release engineering—release engineers, developers, managers, site reliability engineers, and others—to identify and help propose solutions for the most difficult problems in release engineering today."

War and horror stories. I like to see that in a CFP.  Describing how you overcame problems with  infrastructure and tooling to ship software are the best kinds of stories.  They make people laugh. Maybe cry as they realize they are currently living in that situation.  Good times.  Also, I think talks around scaling high volume continuous integration farms will be interesting.  Scaling issues are a lot of fun and expose many issues you don't see when you're only running a few builds a day. 

If you have any questions surrounding the CFP, I'm happy to help as I'm on the program committee.   (my irc nick is kmoir (#releng) as is my email id at mozilla.com)

Read more...

Reminder: Release Engineering Special Issue submission deadline is August 1, 2014

>> Friday, July 18, 2014

Just a friendly reminder that the deadline for the Release Engineering Special Issue is August 1, 2014.  If you have any questions about the submission process or a topic that's you'd like to write about, the guest editors, including myself, are happy to help you!

Read more...

Mozilla pushes - June 2014

Here's June 2014's  analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file

Trends
This was another record breaking month with a total of 12534 pushes.  As a note of interest, this is is over double the number of pushes we had in June 2013. So big kudos to everyone who helped us scale our infrastructure and tooling.  (Actually we had 6,433 pushes in April 2013 which would make this less than half because June 2013 was a bit of a dip.  But still impressive :-)

Highlights
  • 12534 pushes
    • new record
  • 418 pushes/day (average)
    • new record
  • Highest number of pushes/day: 662 pushes on June 5, 2014
    • new record
  • Highest 23.17 pushes/hour (average)
    • new record

General Remarks
The introduction of Gaia-try in April has been very popular and comprised around 30% of pushes in June compared to 29% last month.
The Try branch itself consisted of around 38% of pushes.
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes, compared to 22% in the previous month.

Records
June 2014 was the month with most pushes (12534 pushes)
June 2014 has the highest pushes/day average with
418 pushes/day
June 2014 has the highest average of "pushes-per-hour" is
23.17 pushes/hour
June 4th, 2014 had the highest number of pushes in one day with
662 pushes





Read more...

This week in Mozilla Releng - July 4, 2014

>> Friday, July 04, 2014

This is a special double issue of this week in releng. I was so busy in the last week that I didn't get a chance to post this last week.  Despite the fireworks for Canada Day and Independence Day,  Mozilla release engineering managed to close some bugs. 

Major highlights:

  • Armen, although he works on Ateam now, made blobber uploads discoverable and blogged about it.  Blobber is a server and client side set of tools that allow Releng's test infrastructure to upload files without requiring to deploy ssh keys on them. 
  • Callek and Coop, who served on buildduty during the past two weeks worked to address capacity issues with our test and build infrastructure.  We hit a record of 88,000 jobs yesterday which led to high pending counts.
  • Kim is trying to address the backlog of Android 2.3 test jobs  by moving more test jobs to AWS from our inhouse hardware now that Geoff on the Ateam has found a suitable image.
  • Rail switched jacuzzi EBS from magnetic to SSD.  Jacuzzis are similar pools of build machines  and switching their EBS storage from magnetic to SSD in AWS will improve build times.
 Completed work (resolution is 'FIXED'):
In progress work (unresolved and not assigned to nobody):

Read more...

  © Blogger template Simple n' Sweet by Ourblogtemplates.com 2009

Back to TOP