Releng of the Nerds: Scaling Yosemite

We migrated most of our Mac OS X 10.8 (Mountain Lion) test machines to 10.10.2 (Yosemite) this quarter.

This project had two major constraints:
1) Use the existing hardware pool (~100 r5 mac minis)
2) Keep wait times sane¹. (The machines are constantly running tests most of the day due to the distributed nature of the Mozilla community and this had to continue during the migration.)

So basically upgrade all the machines without letting people notice what you're doing!

Yosemite Valley - Tunnel View Sunrise by ©jeffkrause, Creative Commons by-nc-sa 2.0

Why didn't we just buy more minis and add them to the existing pool of test machines?

We run performance tests and thus need to have all the machines running the same hardware within a pool so performance comparisons are valid. If we buy new hardware, we need to replace the entire pool at once. Machines with different hardware specifications = useless performance test comparisons.
We tried to purchase some used machines with the same hardware specs as our existing machines. However, we couldn't find a source for them. As Apple stops production of old mini hardware each time they announce a new one, they are difficult and expensive to source.

Apple Pi by ©apionid, Creative Commons by-nc-sa 2.0

Given that Yosemite was released last October, why we are only upgrading our test pool now? We wait until the population of users running a new platform² surpass those the old one before switching.

Mountain Lion -> Yosemite is an easy upgrade on your laptop. It's not as simple when you're updating production machines that run tests at scale.

The first step was to pull a few machines out of production and verify the Puppet configuration was working. In Puppet, you can specify commands to only run certain operating system versions. So we implemented several commands to accommodate changes for Yosemite. For instance, changing the default scrollbar behaviour, new services that interfere with test runs needed to be disabled, debug tests required new Apple security permissions configured etc.

Once the Puppet configuration was stable, I updated our configs so the people could run tests on Try and allocated a few machines to this pool. We opened bugs for tests that failed on Yosemite but passed on other platforms. This was a very iterative process. Run tests on try. Look at failures, file bugs, fix test manifests. Once we had to the opt (functional) tests in a green state on try, we could start the migration.

Migration strategy

Disable selected Mountain Lion machines from the production pool
Reimage as Yosemite, update DNS and let them puppetize
Land patches to disable Mountain Lion tests and enable corresponding Yosemite tests on selected branches
Enable Yosemite machines to take production jobs
Reconfig so the buildbot master enable new Yosemite builders and schedule jobs appropriately
Repeat this process in batches

Enable Yosemite opt and performance tests on trunk (gecko >= 39) (50 machines)
Enable Yosemite debug (25 more machines)
Enable Yosemite on mozilla-aurora (15 more machines)

We currently have 14 machines left on Mountain Lion for mozilla-beta and mozilla-release branches.

As a I mentioned earlier, the two constraints with this project were to use the existing hardware pool that constantly runs tests in production and keep the existing wait times sane. We encountered two major problems that impeded that goal:

Persistent and increasing numbers of DNS failures as we migrated more machines to Yosemite. The default Yosemite configuration broadcasts multicast messages via Bonjour. This is fine for a few Apple devices talking to each other in your house. This doesn't scale for 100 machines in a colo. I saw many DNS timeout messages in the system log. This manifested itself large numbers of performance tests failing because they couldn't resolve the name of the graphing server to upload their results. We disabled this multicast broadcast via Puppet and our tests turned green again.
Debug tests initially took 1.5-2x longer to complete on Yosemite than they did on Mountain Lion. If we deployed these tests in production, our wait times would increase because of the longer time required to complete each test run. This was fixed by bug 1138616 Don't track addref and release counts in the bloat log.

It's a compliment when people say things like "I didn't realize that you updated a platform" because it means the upgrade did not cause large scale fires for all to see. So it was a nice to hear that from one of my colleagues this week.

Thanks to philor, RyanVM and jmaher for opening bugs with respect to failing tests and greening them up. Thanks to coop for many code reviews. Thanks dividehex for reimaging all the machines in batches and to arr for her valiant attempts to source new-to-us minis!

References
¹Wait times represent the time from when a job is added to the scheduler database until it actually starts running. We usually try to keep this to under 15 minutes but this really varies on how many machines we have in the pool.
²We run tests for our products on a matrix of operating systems and operating system versions. The terminology for operating system x version in many release engineering shops is a platform. To add to this, the list of platform we support varies across branches. For instance, if we're going to deprecate a platform, we'll let this change ride the trains to release.

Further reading
Bug 1121175: [Tracking] Fix failing tests on Mac OSX 10.10
Bug 1121199: Green up 10.10 tests currently failing on try
Bug 1126493: rollout 10.10 tests in a way that doesn't impact wait times
Bug 1144206: investigate what is causing frequent talos failures on 10.10
Bug 1125998: Debug tests initially took 1.5-2x longer to complete on Yosemite

Why don't you just run these tests in the cloud?

The Apple EULA severely restricts virtualization on Mac hardware.
I don't know of any major cloud vendors that offer the Mac as a platform. Those that claim they do are actually renting racks of Macs on a dedicated per host basis. This does not have the inherent scaling and associated cost saving of cloud computing. In addition, the APIs to manage the machines at scale aren't there.
We manage ~350 Mac minis. We have more experience scaling Apple hardware than many vendors. Not many places run CI at Mozilla scale :-) Hopefully this will change and we'll be able to scale testing on Mac products like we do for Android and Linux in a cloud.

Releng of the Nerds

Scaling Yosemite

>> Friday, March 20, 2015

0 comments:

Post a Comment

Blog Archive