Releng of the Nerds: Scaling mobile testing on AWS

Running tests for Android at Mozilla has typically meant running on reference devices. Physical devices that run jobs on our continuous integration farm via test harnesses. However, this leads to the same problem that we have for other tests that run on bare metal. We can't scale up our capacity without going buying new devices, racking them, configuring them for the network and updating our configurations. In addition, reference cards, rack mounted or not, are rather delicate creatures and have higher retry rates (tests fail due to infrastructure issues and need to be rerun) than those running on emulators (tests run on an Android emulator in a VM on bare metal or cloud)

Do Android's Dream of Electric Sheep? ©Bill McIntyre, Creative Commons by-nc-sa 2.0

Recently, we started running Android 2.3 tests on emulators in AWS. This works well for unit tests (correctness tests). It's not really appropriate for performance tests, but that's another story. This impetus behind this change was so we could decommission Tegras, the reference devices we used for running Android 2.2 tests.

We run many Linux based tests, including Android emulators on AWS spot instances. Spot instances are AWS excess capacity that you can bid on. If someone outbids the price you have paid for your spot instance, you instance can be terminated. But that's okay because we retry jobs if they fail for infrastructure reasons. The overall percentage of spot instances that are terminated is quite small. The huge advantage to using spot instances is price. They are much cheaper than on-demand instances which has allowed us to increase our capacity while continuing to reduce our AWS bill.

We have a wide variety of unit tests that run on emulators for mobile on AWS. We encountered an issue where some of the tests wouldn't run on the default instance type (m1.medium), that we use for our spot instances. Given the number of jobs we run, we want to run on the cheapest AWS instance type that where the tests will complete successfully. At the time we first tested it, we couldn't find an instance type where certain CPU/memory intensive tests would run. So when I first enabled Android 2.3 tests on emulators, I separated the tests so that some would run on AWS spot instances and the ones that needed a more powerful machine would run on our inhouse Linux capacity. But this change consumed all of the capacity of that pool and we had very high number of pending jobs in that pool. This meant that people had to wait a long time for their test results. Not good.

To reduce the pending counts, we needed to buy some more in house Linux capacity or try to run a selected subset of the tests that need more resources or find a new AWS instance type where they would complete successfully. Geoff from the ATeam ran the tests on the c3.xlarge instance type he had tried before and now it seemed to work. In his earlier work the tests did not complete successfully on this instance type. We are unsure as to the reasons why. One of the things about working with AWS is that we don't have a window into the bugs that they fix at their end. So this particular instance type didn't work before, but it does now.

The next steps for me were to create a new AMI (Amazon machine image) that would serve as as the "golden" version for instances that would be created in this pool. Previously, we used Puppet to configure our AWS test machines but now just regenerate the AMI every night via cron and this is the version that's instantiated. The AMI was a copy of the existing Ubuntu64 image that we have but it was configured to run on the c3.xlarge instance type instead of m1.medium. This was a bit tricky because I had to exclude regions where the c3.xlarge instance type was not available. For redundancy (to still have capacity if an entire region goes down) and cost (some regions are cheaper than others), we run instances in multiple AWS regions.

Once I had the new AMI up that would serve as the template for our new slave class, I created a slave with the AMI and verified running the tests we planned to migrate on my staging server. I also enabled two new Linux64 buildbot masters in AWS to service these new slaves, one in us-east-1 and one in us-west-2. When enabling a new pool of test machines, it's always good to look at the load on the current buildbot masters and see if additional masters are needed so the current masters aren't overwhelmed with too many slaves attached.

After the tests were all green, I modified our configs to run this subset of tests on a branch (ash), enabled the slave platform in Puppet and added a pool of devices to this slave platform in our production configs. After the reconfig deployed these changes into production, I landed a regular expression to watch_pending.cfg to so that new tst-emulator64-spot pool of machines would be allocated to the subset of tests and branch I enabled them on. The watch_pending.py script watches the number of pending jobs that on AWS and creates instances as required. We also have scripts to terminate or stop idle instances when we don't get them. Why pay for machines when you don't need them now? After the tests ran successfully on ash, I enabled running the tests on the other relevant branches.