Learning Data Science and evidence based teaching methods

>> Friday, July 17, 2015

This spring, I took several online courses on the topic of data science.  I became interested in expanding my skills in this area because as release engineers, we deal with a lot of data.  I wanted to learn new tools to extract useful information from the distributed systems behemoth we manage.

This xckd reminded me of the challenges of managing our buildfarm somedays :-)

From http://xkcd.com/1546/

I took three courses from Coursera's Data Science track from John Hopkins University. As with previous coursera classes I took, all the course material is online (lecture videos and notes).  There are quizzes and assignments that are due each week.  Each course below was about four weeks long.

The Data Scientist's Toolbox - This course was pretty easy. Basically a introduction to the questions that data scientists deal as well a primer on installing R, RStudio (IDE for R), and using GitHub.
R Programming - Introduction to R.  Most of the quizzes and examples used publicly available data for the programming exercises.  I found I had to do a lot of reading in the R API docs or on stackoverflow to finish the assignments.  The lectures didn't provide a lot of the material needed to complete the assignments.  Lots of techniques to learn how to subset data using R which I found quite interesting, reminded me a lot of querying databases with SQL to conduct analysis.
Getting and Cleaning Data - More advanced techniques using R.  Using publicly available data sources to clean different data sources in different formats, XML, excel spreadsheets, comma or tab delimited. Given this data, we had to answer many questions and conduct specific analysis by writing  R programs.  The assignments were pretty challenging and took a long time. Again, the course material didn't really cover all the material you needed to do the assignments so a lot of additional reading was required.

There are six more courses in the Data Science track that I'll start tackling again in the fall that cover subjects such as reproducible research, statistical inference and machine learning.   My next coursera  class is Introduction to Systems Engineering which I'll start in a couple of weeks.  I've really become interested in learning more about this subject after reading Thinking in Systems.

The other course I took this spring was the Software Carpentry Instructor training course.   The Software Carpentry Foundation teachers researchers basic software skills.  For instance, if you are a biologist analyzing large data sets it would be useful to learn how to use R, Python, and version control to store the code you wrote to share with others.  These are not skills that many scientists acquire in their formal university training, and learning them allows them to work more productively.  The instructor course was excellent, thanks Greg Wilson for your work teaching us.

We read two books for this course:
Building a Better Teacher: An interesting overview of how teacher is taught in different countries and how to make it more effective. Most important: Have more opportunities for other teachers to observe your classroom and provide feedback which I found analogous to how code review makes us better software developers.
How Learning Works: Seven Research-Based Principles for Smart Teaching: A book summarizing the research in disciplines such as education, cognitive science and psychology on the effective techniques for teaching students new material.  How assessing student's prior knowledge can help you better design your lessons, how to to ask questions to determine what material students are failing to grasp, how to understand student's motivation for learning and more.  Really interesting research.

For the instructor course, we met every couple of weeks online where Greg would conduct a short discussion on some of the topics on a conference call and we would discuss via etherpad interactively. We would then meet in smaller groups later in the week to conduct practice teaching exercises.  We also submitted example lessons to the course repo on GitHub. The final project for the course was to conduct a short lesson to a group of instructors that gave feedback, and submit a pull request to update an existing lesson with a fix.  Then we are ready to sign up to teach a Software Carpentry course!

In conclusion, data science is a great skill to have if you are managing large distributed systems.  Also, using evidence based teaching methods to help others learn is the way to go!

Other fun data science examples include
Tracking down the Villains: Outlier Detection at Netflix - detecting rogue servers with machine learning
Finding Shoe Stores in 100k Merchants: Using Data to Group All Things - finding out what Shopify merchants sell shoes using Apache Spark and more
Looking Through Camera Lenses: The Application of Computer Vision at Etsy


Test job reduction by the numbers

>> Tuesday, June 16, 2015

In an earlier post,  I wrote how we had reduced the amount of test jobs that run on two branches to allow us to scale our infrastructure more effectively.  We run the tests that historically identify regressions more often.  The ones that don't, we skip on every Nth push.  We now have data on how this reduced the number of jobs we run since we began implementation in April.

We run SETA on two branches (mozilla-inbound and fx-team) and on 18 types of builds.  Collectively, these two branches represent about 20% of pushes each month.  Implementing SETA allowed us to move  from ~400 -> ~240 jobs per push on these two branches1 We run the tests identified as not reporting regressions on every 10th commit or 90 minutes since the last test was scheduled.  We run the critical tests on every commit.2

Reduction in number of jobs per push on mozilla-inbound as SETA scheduling is rolled out

A graph for the fx-team branch shows a similar trend. It was a staged rollout starting in early April, as I enabled platforms and as the SETA data became available. The dip in early June reflects where I enabled SETA for Android 4.3.

This data will continue to be updated in our scheduling configuration as it evolves and is updated by the code that Joel and Vaibhav wrote to analyze regressions. The analysis identifies that there were

Jobs to ignore: 440
Jobs to run: 114
Total number of jobs: 554

which is significant.  Our buildbot configurations are updated the latest SETA data with every reconfig, which occurs usually occurs every couple of days.

The platforms configured to run fewer tests for both opt and debug are

        MacOSX (10.6, 10.10)
        Windows (XP, 7, 8)
        Ubuntu 12.04 for linux32, linux64 and ASAN x64
        Android 2.3 armv7 API 9
        Android 4.3 armv7 API 11+

Additional info
1Tests may have been disabled/added at the same time,  this is not taken into account
2There still some scheduling issues to be fixed see bug 1174870  and bug 1174746 for further details


Mozilla pushes - May 2015

>> Friday, June 12, 2015

Here's May 2015's monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.


The number of pushes decreased from those recorded in the previous month (8894) with a total of 8363. 


  • 8363 pushes
  • 270 pushes/day (average)
  • Highest number of pushes/day: 445 pushes on May 21, 2015
  • 16.03 pushes/hour (highest average)

General Remarks
  • Try has around 62% of all the pushes now
  • The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 27% of all the pushes.

  • August 2014 was the month with most pushes (13090  pushes)
  • August 2014 had the highest pushes/day average with 422 pushes/day
  • July 2014 had the highest average of "pushes-per-hour" with 23.51 pushes/hour
  • October 8, 2014 had the highest number of pushes in one day with 715 pushes 


Mozilla pushes - April 2015

>> Friday, May 01, 2015

Here's April 2015's  monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.  

The number of pushes decreased from those recorded in the previous month with a total of 8894.  This is due to the fact that gaia-try is managed by taskcluster and thus these jobs don't appear in the buildbot scheduling databases anymore which this report tracks.


  • 8894 pushes
  • 296 pushes/day (average)
  • Highest number of pushes/day: 528 pushes on Apr 1, 2015
  • 17.87 pushes/hour (highest average)

General Remarks

  • Try has around 58% of all the pushes now that we no longer track gaia-try
  • The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 28% of all the pushes.


  • August 2014 was the month with most pushes (13090  pushes)
  • August 2014 had the highest pushes/day average with 422 pushes/day
  • July 2014 had the highest average of "pushes-per-hour" with 23.51 pushes/hour
  • October 8, 2014 had the highest number of pushes in one day with 715 pushes  

I've changed the graphs to only track 2015 data.  Last month they were tracking 2014 data as well but it looked crowded so I updated them.  Here's a graph showing the number of pushes over the last few years for comparison.


Releng 2015 program now available

>> Tuesday, April 28, 2015

Releng 2015 will take place in concert with ICSE in Florence, Italy on May 19, 2015. The program is now available. Register here!

via romana in firenze by ©pinomoscato, Creative Commons by-nc-sa 2.0


Less testing, same great Firefox taste!

Running a large continuous integration farm forces you to deal with many dynamic inputs coupled with capacity constraints. The number of pushes increase.  People add more tests.  We build and test on a new platform.  If the number of machines available remains static, the computing time associated with a single push will increase.  You can scale this for platforms that you build and test in the cloud (for us - Linux and Android on emulators), but this costs more money.  Adding hardware for other platforms such as Mac and Windows in data centres is also costly and time consuming.

Do we really need to run every test on every commit? If not, which tests should be run?  How often do they need to be run in order to catch regressions in a timely manner (i.e. able to bisect where the regression occurred)

Several months ago, jmaher and vaibhav1994, wrote code to analyze the test data and determine the minimum number of tests required to run to identify regressions.  They named their software SETA (search for extraneous test automation). They used historical data to determine the minimum set of tests that needed to be run to catch historical regressions.  Previously, we coalesced tests on a number of platforms to mitigate too many jobs being queued for too few machines.  However, this was not the best way to proceed because it reduced the number of times we ran all tests, not just less useful ones.  SETA allows us to run a subset of tests on every commit that historically have caught regressions.  We still run all the test suites, but at a specified interval. 

SETI – The Search for Extraterrestrial Intelligence by ©encouragement, Creative Commons by-nc-sa 2.0
In the last few weeks, I've implemented SETA scheduling in our our buildbot configs to use the data that the analysis that Vaibhav and Joel  implemented.  Currently, it's implemented on mozilla-inbound and fx-team branches which in aggregate represent around 19.6% (March 2015 data) of total pushes to the trees.  The platforms configured to run fewer tests for both opt and debug are
  • MacOSX (10.6, 10.10)
  • Windows (XP, 7, 8)
  • Ubuntu 12.04 for linux32, linux64 and ASAN x64
  • Android 2.3 armv7 API 9

As we gather more SETA data for newer platforms, such as Android 4.3, we can implement SETA scheduling for it as well and reduce our test load.  We continue to run the full suite of tests on all platforms other branches other than m-i and fx-team, such as mozilla-central, try, and the beta and release branches. If we did miss a regression by reducing the tests, it would appear on other branches mozilla-central. We will continue to update our configs to incorporate SETA data as it changes.

How does SETA scheduling work?
We specify the tests that we would like to run on a reduced schedule in our buildbot configs.  For instance, this specifies that we would like to run these debug tests on every 10th commit or if we reach a timeout of 5400 seconds between tests.


Previously, catlee had implemented a scheduling in buildbot that allowed us to coallesce jobs on a certain branch and platform using EveryNthScheduler.  However, as it was originally implemented, it didn't allow us to specify tests to skip, such as mochitest-3 debug on MacOSX 10.10 on mozilla-inbound.  It would only allow us to skip all the debug or opt tests for a certain platform and branch.

I modified misc.py to parse the configs and create a dictionary for each test specifying the interval at which the test should be skipped and the timeout interval.  If the tests has these parameters specified, it should be scheduled using the  EveryNthScheduler instead of the default scheduler.

There are still some quirks to work out but I think it is working out well so far. I'll have some graphs in a future post on how this reduced our test load. 

Further reading
Joel Maher: SETA – Search for Extraneous Test Automation


Mozilla pushes - March 2015

>> Wednesday, April 15, 2015

Here's March 2015's  monthly analysis of the pushes to our Mozilla development trees. You can load the data as an HTML page or as a json file.

The number of pushes increased from those recorded in the previous month with a total of 10943. 


  • 10943 pushes
  • 353 pushes/day (average)
  • Highest number of pushes/day: 579 pushes on Mar 11, 2015
  • 23.18 pushes/hour (highest average)

General Remarks
  • Try keeps on having around 49% of all the pushes
  • The three integration repositories (fx-team, mozilla-inbound and b2g-inbound) account around 26% of all the pushes.

  • August 2014 was the month with most pushes (13090  pushes)
  • August 2014 had the highest pushes/day average with 422 pushes/day
  • July 2014 had the highest average of "pushes-per-hour" with 23.51 pushes/hour
  • October 8, 2014 had the highest number of pushes in one day with 715 pushes 


Scaling Yosemite

>> Friday, March 20, 2015

We migrated most of our Mac OS X 10.8 (Mountain Lion) test machines to 10.10.2 (Yosemite) this quarter.

This project had two major constraints:
1) Use the existing hardware pool (~100 r5 mac minis)
2) Keep wait times sane1.  (The machines are constantly running tests most of the day due to the distributed nature of the Mozilla community and this had to continue during the migration.)

So basically upgrade all the machines without letting people notice what you're doing!

Yosemite Valley - Tunnel View Sunrise by ©jeffkrause, Creative Commons by-nc-sa 2.0

Why didn't we just buy more minis and add them to the existing pool of test machines?
  1. We run performance tests and thus need to have all the machines running the same hardware within a pool so performance comparisons are valid.  If we buy new hardware, we need to replace the entire pool at once.  Machines with different hardware specifications = useless performance test comparisons.
  2. We tried to purchase some used machines with the same hardware specs as our existing machines.  However, we couldn't find a source for them.  As Apple stops production of old mini hardware each time they announce a new one, they are difficult and expensive to source.
Apple Pi by ©apionid, Creative Commons by-nc-sa 2.0

Given that Yosemite was released last October, why we are only upgrading our test pool now?  We wait until the population of users running a new platform2 surpass those the old one before switching.

Mountain Lion -> Yosemite is an easy upgrade on your laptop.  It's not as simple when you're updating production machines that run tests at scale.

The first step was to pull a few machines out of production and verify the Puppet configuration was working.  In Puppet, you can specify commands to only run certain operating system versions. So we implemented several commands to accommodate changes for Yosemite. For instance, changing the default scrollbar behaviour, new services that interfere with test runs needed to be disabled, debug tests required new Apple security permissions configured etc.

Once the Puppet configuration was stable, I updated our configs so the people could run tests on Try and allocated a few machines to this pool. We opened bugs for tests that failed on Yosemite but passed on other platforms.  This was a very iterative process.  Run tests on try.  Look at failures, file bugs, fix test manifests. Once we had to the opt (functional) tests in a green state on try, we could start the migration.

Migration strategy
  • Disable selected Mountain Lion machines from the production pool
  • Reimage as Yosemite, update DNS and let them puppetize
  • Land patches to disable Mountain Lion tests and enable corresponding Yosemite tests on selected branches
  • Enable Yosemite machines to take production jobs
  • Reconfig so the buildbot master enable new Yosemite builders and schedule jobs appropriately
  • Repeat this process in batches
    • Enable Yosemite opt and performance tests on trunk (gecko >= 39) (50 machines)
    • Enable Yosemite debug (25 more machines)
    • Enable Yosemite on mozilla-aurora (15 more machines)
We currently have 14 machines left on Mountain Lion for mozilla-beta and mozilla-release branches.

As a I mentioned earlier, the two constraints with this project were to use the existing hardware pool that constantly runs tests in production and keep the existing wait times sane.  We encountered two major problems that impeded that goal:

It's a compliment when people say things like "I didn't realize that you updated a platform" because it means the upgrade did not cause large scale fires for all to see.  So it was a nice to hear that from one of my colleagues this week.

Thanks to philor, RyanVM and jmaher for opening bugs with respect to failing tests and greening them up.  Thanks to coop for many code reviews. Thanks dividehex for reimaging all the machines in batches and to arr for her valiant attempts to source new-to-us minis!

1Wait times represent the time from when a job is added to the scheduler database until it actually starts running. We usually try to keep this to under 15 minutes but this really varies on how many machines we have in the pool.
2We run tests for our products on a matrix of operating systems and operating system versions. The terminology for operating system x version in many release engineering shops is a platform.  To add to this, the list of platform we support varies across branches.  For instance, if we're going to deprecate a platform, we'll let this change ride the trains to release.

Further reading
Bug 1121175: [Tracking] Fix failing tests on Mac OSX 10.10 
Bug 1121199: Green up 10.10 tests currently failing on try 
Bug 1126493: rollout 10.10 tests in a way that doesn't impact wait times
Bug 1144206: investigate what is causing frequent talos failures on 10.10
Bug 1125998: Debug tests initially took 1.5-2x longer to complete on Yosemite

Why don't you just run these tests in the cloud?
  1. The Apple EULA severely restricts virtualization on Mac hardware. 
  2. I don't know of any major cloud vendors that offer the Mac as a platform.  Those that claim they do are actually renting racks of Macs on a dedicated per host basis.  This does not have the inherent scaling and associated cost saving of cloud computing.  In addition, the APIs to manage the machines at scale aren't there.
  3. We manage ~350 Mac minis.  We have more experience scaling Apple hardware than many vendors. Not many places run CI at Mozilla scale :-) Hopefully this will change and we'll be able to scale testing on Mac products like we do for Android and Linux in a cloud.


  © Blogger template Simple n' Sweet by Ourblogtemplates.com 2009

Back to TOP