Releng of the Nerds: September 2015

In September, Mozilla release engineering started experiencing high pending counts on our test pools, notably Windows, but also Linux (and consequently Android). High pending counts mean that there are thousands of jobs queued to run on the machines that are busy running other jobs. The time developers have to wait for their test results is longer than ideal.

Usually, pending counts clear overnight as less code is pushed during the night (in North America) which invokes fewer builds and tests. However, as you can see from the graph above, the Windows test pending counts were flat last night. They did not clear up overnight. You will also note that try, which usually comprises 63% of our load, has very highest pending counts compared to other branches. This is because many people land on try before pushing to other branches, and tests aren't coalesced on try.

The work to determine the cause of high pending counts is always an interesting mystery.

Are end to end times for tests increasing?
Have more tests been enabled recently?
Are retries increasing? (Tests the run multiple times because the initial runs fail due to infrastructure issues)
Are jobs that are coalesced being backfilled and consuming capacity?
Are tests being chunked into smaller jobs that increase end to end time due to the added start up time?

Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0

Joel Maher and I looked at the data for this last week and discovered what we believe to be the source of the problem. We have determined that since the end of August a number of new test jobs were enabled that increased the compute time per push on Windows by 13% or 2.5 hours per push. Most of these new test jobs are for e10s.

Increase in seconds that new jobs added to the total compute time per push. (Some existing jobs also reduced their compute time for a total difference about about 2.5 more hours per push on Windows)

The e10s initiative is an important initiative for Mozilla to make Firefox performance and security even better. However, since new e10s and old tests will continue to run in parallel, we need to get creative on how to have acceptable wait times given the limitations of our current Windows tests pools. (All of our Windows test run on bare metal in our datacentre, not on Amazon).

Release engineering is working to reduce this pending counts given our current hardware constraints with the following initiatives:

To reduce Linux pending counts:

Added 200 new instances to the tst-emulator64 pool (run Android test jobs on Linux emulators) (bug 1204756)
In process of adding more Linux32 and Linux64 buildbot masters (bug 1205409) which will allow us to expand our capacity more

Ongoing work to reduce the Windows pending counts:

Disable Linux32 Talos tests and redeploy these machines as Windows test machines (bug 1204920 and bug 1208449)
Reduce the number of talos jobs by running SETA on talos (bug 1192994)
Developer productivity team is investigating whether non-operating specific tests that run on multiple windows test platforms can run on fewer platforms.

How can you help?

Please be considerate when invoking try pushes and only select the platforms that you explicitly require to test. Each try push for all platforms and all tests invokes over 800 jobs.

Releng of the Nerds

The mystery of high pending counts

>> Friday, September 25, 2015

Blog Archive