Mozilla pushes - July 2014

>> Friday, August 08, 2014

Here's the July 2014 monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.
 
Trends
Like every month for the past while, we had a new record number of pushes. In reality, given that July is one day longer than June, the numbers are quite similar.

Highlights

  • 12,755 pushes
    • new record
  •  411 pushes/day (average)
  • Highest number of pushes/day: 625 pushes on July 3, 2014
  • Highest 23.51 pushes/hour (average)
    • new record

General remarks
Try keeps on having around 38% of all the pushes. Gaia-Try is in second place with around 31% of pushes.  The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 22% of all the pushes.

Records 
July 2014 was the month with most pushes (12,755 pushes)
June 2014 has the highest pushes/day average with 662 pushes/day
July 2014 has the highest average of "pushes-per-hour" with 23.51 pushes/hour
June 4th, 2014 had the highest number of pushes in one day with 662 
 

Read more...

Scaling mobile testing on AWS

>> Thursday, August 07, 2014

Running tests for Android at Mozilla has typically meant running on reference devices.  Physical devices that run jobs on our continuous integration farm via test harnesses.  However, this leads to the same problem that we have for other tests that run on bare metal.  We can't scale up our capacity without going buying new devices, racking them, configuring them for the network and updating our configurations.  In addition, reference cards, rack mounted or not, are rather delicate creatures and have higher retry rates (tests fail due to infrastructure issues and need to be rerun) than those running on emulators (tests run on an Android emulator in a VM on bare metal or cloud)

Do Android's Dream of Electric Sheep?  ©Bill McIntyre, Creative Commons by-nc-sa 2.0
Recently, we started running Android 2.3 tests on emulators in AWS.  This works well for unit tests (correctness tests).  It's not really appropriate for performance tests, but that's another story.  This impetus behind this change was so we could decommission Tegras, the reference devices we used for running Android 2.2 tests. 

We run many Linux based tests, including Android emulators on AWS spot instances.  Spot instances are AWS excess capacity that you can bid on.  If someone outbids the price you have paid for your spot instance, you instance can be terminated.  But that's okay because we retry jobs if they fail for infrastructure reasons.  The overall percentage of spot instances that are terminated is quite small.  The huge advantage to using spot instances is price.  They are much cheaper than on-demand instances which has allowed us to increase our capacity while continuing to reduce our AWS bill

We have a wide variety of unit tests that run on emulators for mobile on AWS.  We encountered an issue where some of the tests wouldn't run on the default instance type (m1.medium), that we use for our spot instances.   Given the number of jobs we run, we want to run on the cheapest AWS instance type that where the tests will complete successfully.  At the time we first tested it, we couldn't find an instance type where certain CPU/memory intensive tests would run.  So when I first enabled Android 2.3 tests on emulators, I separated the tests so that some would run on AWS spot instances and the ones that needed a more powerful machine would run on our inhouse Linux capacity.  But this change consumed all of the capacity of that pool and we had very high number of pending jobs in that pool.  This meant that people had to wait a long time for their test results.  Not good.

To reduce the pending counts, we needed to buy some more in house Linux capacity or try to run a selected subset of the tests that need more resources or find a new AWS instance type where they would complete successfully.  Geoff from the ATeam ran the tests on the c3.xlarge instance type he had tried before and now it seemed to work.  In his earlier work the tests did not complete successfully on this instance type.  We are unsure as to the reasons why.  One of the things about working with AWS is that we don't have a window into the bugs that they fix at their end.  So this particular instance type didn't work before, but it does now.

The next steps for me were to create a new AMI (Amazon machine image) that would serve as as the "golden" version for instances that would be created in this pool.  Previously, we used Puppet to configure our AWS test machines but now just regenerate the AMI every night via cron and this is the version that's instantiated.  The AMI was a copy of the existing Ubuntu64 image that we have but it was configured to run on the c3.xlarge instance type instead of m1.medium. This was a bit tricky because I had to exclude regions where the c3.xlarge instance type was not available.  For redundancy (to still have capacity if an entire region goes down) and cost (some regions are cheaper than others), we run instances in multiple AWS regions

Once I had the new AMI up that would serve as the template for our new slave class, I created a slave with the AMI and verified running the tests we planned to migrate on my staging server.  I also enabled two new Linux64 buildbot masters in AWS to service these new slaves, one in us-east-1 and one in us-west-2.  When enabling a new pool of test machines, it's always good to look at the load on the current buildbot masters and see if additional masters are needed so the current masters aren't overwhelmed with too many slaves attached.

After the tests were all green, I modified our configs to run this subset of tests on a branch (ash), enabled the slave platform in Puppet and added a pool of devices to this slave platform in our production configs.  After the reconfig deployed these changes into production, I landed a regular expression to watch_pending.cfg to so that new tst-emulator64-spot pool of machines would be allocated to the subset of tests and branch I enabled them on. The watch_pending.py script watches the number of pending jobs that on AWS and creates instances as required.  We also have scripts to terminate or stop idle instances when we don't get them.  Why pay for machines when you don't need them now?  After the tests ran successfully on ash, I enabled running the tests on the other relevant branches.

Royal Border Bridge.  Also, release engineers love to see green builds and tests.  ©Jonathan Combe, Creative Commons by-nc-sa 2.0
The end result is that some Android 2.3 tests run on m1.medium or (tst-linux64-spot instances), such as mochitests.



And some Android 2.3 tests run on c3.xlarge or (tst-emulator64-spot instances), such as crashtests.

 

In enabling this slave class within our configs, we were also able to reuse it for some b2g tests which also faced the same problem where they needed a more powerful instance type for the tests to complete.

Lessons learned:
Use the minimum (cheapest) instance type required to complete your tests
As usual, test on a branch before full deployment
Scaling mobile tests doesn't mean more racks of reference cards

Future work:
Bug 1047467 c3.xlarge instance types are expensive, let's test running those tests on a range of instance types that are cheaper

Further reading:
AWS instance types 
Chris Atlee wrote about how we Now Use AWS Spot Instances for Tests
Taras Glek wrote How Mozilla Amazon EC2 Usage Got 15X Cheaper in 8 months
Rail Aliiev http://rail.merail.ca/posts/firefox-builds-are-way-cheaper-now.html 
Bug 980519 Experiment with other instance types for Android 2.3 jobs 
Bug 1024091 Address high pending count in in-house Linux64 test pool 
Bug 1028293 Increase Android 2.3 mochitest chunks, for aws 
Bug 1032268 Experiment with c3.xlarge for Android 2.3 jobs
Bug 1035863 Add two new Linux64 masters to accommodate new emulator slaves
Bug 1034055 Implement c3.xlarge slave class for Linux64 test spot instances
Bug 1031083 Buildbot changes to run selected b2g tests on c3.xlarge
Bug 1047467 c3.xlarge instance types are expensive, let's try running those tests on a range of instance types that are cheaper

Read more...

2014 USENIX Release Engineering Summit CFP now open

>> Monday, July 28, 2014

The CFP for the 2014 Release Engineering summit (Western edition) is now open.  The deadline for submissions is September 5, 2014 and speakers will be notified by September 19, 2014.  The program will be announced in late September.  This one day summit on all things release engineering will be held in concert with LISA, in Seattle on November 10, 2014. 

Seattle skyline © Howard Ignatius, https://flic.kr/p/6tQ3H Creative Commons by-nc-sa 2.0


From the CFP


"Suggestions for topics include (but are not limited to):
  • Best practices for release engineering
  • Practical information on specific aspects of release engineering (e.g., source code management, dependency management, packaging, unit tests, deployment)
  • Future challenges and opportunities in release engineering
  • Solutions for scalable end-to-end release processes
  • Scaling infrastructure and tools for high-volume continuous integration farms
  • War and horror stories
  • Metrics
  • Specific problems and solutions for specific markets (mobile, financial, cloud)
URES '14 West is looking for relevant and engaging speakers and workshop facilitators for our event on November 10, 2014, in Seattle, WA. URES brings together people from all areas of release engineering—release engineers, developers, managers, site reliability engineers, and others—to identify and help propose solutions for the most difficult problems in release engineering today."

War and horror stories. I like to see that in a CFP.  Describing how you overcame problems with  infrastructure and tooling to ship software are the best kinds of stories.  They make people laugh. Maybe cry as they realize they are currently living in that situation.  Good times.  Also, I think talks around scaling high volume continuous integration farms will be interesting.  Scaling issues are a lot of fun and expose many issues you don't see when you're only running a few builds a day. 

If you have any questions surrounding the CFP, I'm happy to help as I'm on the program committee.   (my irc nick is kmoir (#releng) as is my email id at mozilla.com)

Read more...

Reminder: Release Engineering Special Issue submission deadline is August 1, 2014

>> Friday, July 18, 2014

Just a friendly reminder that the deadline for the Release Engineering Special Issue is August 1, 2014.  If you have any questions about the submission process or a topic that's you'd like to write about, the guest editors, including myself, are happy to help you!

Read more...

Mozilla pushes - June 2014

Here's June 2014's  analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file

Trends
This was another record breaking month with a total of 12534 pushes.  As a note of interest, this is is over double the number of pushes we had in June 2013. So big kudos to everyone who helped us scale our infrastructure and tooling.  (Actually we had 6,433 pushes in April 2013 which would make this less than half because June 2013 was a bit of a dip.  But still impressive :-)

Highlights
  • 12534 pushes
    • new record
  • 418 pushes/day (average)
    • new record
  • Highest number of pushes/day: 662 pushes on June 5, 2014
    • new record
  • Highest 23.17 pushes/hour (average)
    • new record

General Remarks
The introduction of Gaia-try in April has been very popular and comprised around 30% of pushes in June compared to 29% last month.
The Try branch itself consisted of around 38% of pushes.
The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 21% of all the pushes, compared to 22% in the previous month.

Records
June 2014 was the month with most pushes (12534 pushes)
June 2014 has the highest pushes/day average with
418 pushes/day
June 2014 has the highest average of "pushes-per-hour" is
23.17 pushes/hour
June 4th, 2014 had the highest number of pushes in one day with
662 pushes





Read more...

This week in Mozilla Releng - July 4, 2014

>> Friday, July 04, 2014

This is a special double issue of this week in releng. I was so busy in the last week that I didn't get a chance to post this last week.  Despite the fireworks for Canada Day and Independence Day,  Mozilla release engineering managed to close some bugs. 

Major highlights:

  • Armen, although he works on Ateam now, made blobber uploads discoverable and blogged about it.  Blobber is a server and client side set of tools that allow Releng's test infrastructure to upload files without requiring to deploy ssh keys on them. 
  • Callek and Coop, who served on buildduty during the past two weeks worked to address capacity issues with our test and build infrastructure.  We hit a record of 88,000 jobs yesterday which led to high pending counts.
  • Kim is trying to address the backlog of Android 2.3 test jobs  by moving more test jobs to AWS from our inhouse hardware now that Geoff on the Ateam has found a suitable image.
  • Rail switched jacuzzi EBS from magnetic to SSD.  Jacuzzis are similar pools of build machines  and switching their EBS storage from magnetic to SSD in AWS will improve build times.
 Completed work (resolution is 'FIXED'):
In progress work (unresolved and not assigned to nobody):

Read more...

Introducing Mozilla Releng's summer interns

>> Friday, June 20, 2014

The Mozilla Release Engineering team recently welcomed three interns to our team for the summer.

Ian Connolly is a student at Trinity College in Dublin. This is his first term with Mozilla and he's working on preflight slave tasks and an example project for Releng API.
Andhad Jai Singh is a student at Indian Institute of Technology Hyderabad.  This is his second term working at Mozilla, he was a Google Summer of Code student with the Ateam last year.  This term he's working on generating partial updates on request.
John Zeller is also a returning student and studies at Oregon State University.  He previously had a work term with Mozilla releng and also worked during the past school term as a student worker implementing Mozilla Releng apps in Docker. This term he'll work on updating our ship-it application  so that release automation updates ship it more frequently so we can see the state of the release, as well as integrating post-release tasks.

 

View from Mozilla San Francisco Office

Please drop by and say hello to them if you're in our San Francisco office.  Or say hello to them in #releng - their irc nicknames are ianconnolly, ffledgling and zeller respectively.

Welcome!

Read more...

This week in Mozilla Releng - June 20, 2014

Ben is away for the next few Fridays, so I'll be covering this blog post for the next couple of weeks.

Major highlights:


Completed work (resolution is 'FIXED'):
In progress work (unresolved and not assigned to nobody):

Read more...

Talking about speaking up

>> Monday, June 09, 2014

We all interpret life through the lens of our previous experiences.  It's difficult to understand what each day is like for someone who has had a life fundamentally different from your own because you simply haven't had those experiences.  I don't understand what it's like to transition from male to female while involved in an open source community.  I don't know the steps taken to become an astrophysicist.  To embark to a new country as an immigrant.   I haven't lived struggled to survive on the streets as homeless person. Or a person who has been battered by domestic abuse.  To understand the experiences of others, all we can do is listen and learn from others, with empathy.

There have been many news stories recently about women or other underrepresented groups in technology.   I won't repeat them because frankly, they're quite depressing.  They go something like this:
1.  Incident of harassment/sexism either online/at a company/in a community/at a conference
2.  People call out this behaviour online and ask the organization to apologize and take steps to prevent this in the future.
3.  People from underrepresented groups who speak up about behaviour are told that their feelings are not valid or they are overreacting.  Even worse, they are harassed online with hateful statements telling them they don't belong in tech or are threatened with sexual assault or other acts of violence.
4.  Company/community/conference apologizes and issue written statement. Or not.
5. Goto 1


I watched an extraordinary talk the other day that really provided a vivid perspective about the challenges that women in technology face and what people can do to help. Brianna Wu is head of development at Giant Spacekat, a game development company.  She gave the talk "Nine ways to stop hurting and start helping women in tech" at AltConf last week.  She is brutally honest with the problems that exist in our companies and communities, and the steps forward to make it better. 




She talks about how she is threatened and harassed online. She also discusses how random people threatening you on the internet is not a just theoretical, but really frightening because she knows it could result in actual physical violence.   The same thing applies to street harassment. 

Here's the thing about being a woman.  I'm a physically strong person. I can run.  But I'm keenly aware that men are almost always bigger than me, and by basic tenets of physiology, stronger than me. So if a man tried to physically attack me, chances are I'd lose that fight.  So when someone threatens you, online or not, it is profoundly frightening because you fear for your physical safety. And to have that happen over and over again, like many women in our industry experience, apart from being terrifying, is exhausting and has a huge emotional toll.

I was going to summarize the points she brings up in her talk but she speaks so powerfully that all I can do is encourage you to watch the talk.

One of her final points really drives home the need for change in our industry when she says to the audience "This is not a problem that women can solve on their own....If you talk to your male friends out there, you guys have a tremendous amount of power as peers.  To talk to them and say, look dude this isn't okay.  You can't do this, you can't talk this way.  You need to think about this behaviour. You guys need to make a difference in a way that I can't."  Because when she talks about this behaviour to men, it often goes in one ear and out the next.  To be a ally in any sense of the word, you need to speak up.

THIS 1000x THIS.

Thank you Brianna for giving this talk.  I hope that when others see it they will gain some insight and feel some empathy on the challenges that women, and other underrepresented groups in the technology industry face.  And that you will all speak up too.

Further reading
Ashe Dryden's The 101-Level Reader: Books to Help You Better Understand Your Biases and the Lived Experiences of People                                                                                                           
Ashe Dryden Our most wicked problem

Read more...

  © Blogger template Simple n' Sweet by Ourblogtemplates.com 2009

Back to TOP