>> Friday, July 17, 2015
This spring, I took several online courses on the topic of data science. I became interested in expanding my skills in this area because as release engineers, we deal with a lot of data. I wanted to learn new tools to extract useful information from the distributed systems behemoth we manage.
This xckd reminded me of the challenges of managing our buildfarm somedays :-)
I took three courses from Coursera's Data Science track from John Hopkins University. As with previous coursera classes I took, all the course material is online (lecture videos and notes). There are quizzes and assignments that are due each week. Each course below was about four weeks long.
The Data Scientist's Toolbox - This course was pretty easy. Basically a introduction to the questions that data scientists deal as well a primer on installing R, RStudio (IDE for R), and using GitHub.
R Programming - Introduction to R. Most of the quizzes and examples used publicly available data for the programming exercises. I found I had to do a lot of reading in the R API docs or on stackoverflow to finish the assignments. The lectures didn't provide a lot of the material needed to complete the assignments. Lots of techniques to learn how to subset data using R which I found quite interesting, reminded me a lot of querying databases with SQL to conduct analysis.
Getting and Cleaning Data - More advanced techniques using R. Using publicly available data sources to clean different data sources in different formats, XML, excel spreadsheets, comma or tab delimited. Given this data, we had to answer many questions and conduct specific analysis by writing R programs. The assignments were pretty challenging and took a long time. Again, the course material didn't really cover all the material you needed to do the assignments so a lot of additional reading was required.
There are six more courses in the Data Science track that I'll start tackling again in the fall that cover subjects such as reproducible research, statistical inference and machine learning. My next coursera class is Introduction to Systems Engineering which I'll start in a couple of weeks. I've really become interested in learning more about this subject after reading Thinking in Systems.
The other course I took this spring was the Software Carpentry Instructor training course. The Software Carpentry Foundation teachers researchers basic software skills. For instance, if you are a biologist analyzing large data sets it would be useful to learn how to use R, Python, and version control to store the code you wrote to share with others. These are not skills that many scientists acquire in their formal university training, and learning them allows them to work more productively. The instructor course was excellent, thanks Greg Wilson for your work teaching us.
We read two books for this course:
Building a Better Teacher: An interesting overview of how teacher is taught in different countries and how to make it more effective. Most important: Have more opportunities for other teachers to observe your classroom and provide feedback which I found analogous to how code review makes us better software developers.
How Learning Works: Seven Research-Based Principles for Smart Teaching: A book summarizing the research in disciplines such as education, cognitive science and psychology on the effective techniques for teaching students new material. How assessing student's prior knowledge can help you better design your lessons, how to to ask questions to determine what material students are failing to grasp, how to understand student's motivation for learning and more. Really interesting research.
For the instructor course, we met every couple of weeks online where Greg would conduct a short discussion on some of the topics on a conference call and we would discuss via etherpad interactively. We would then meet in smaller groups later in the week to conduct practice teaching exercises. We also submitted example lessons to the course repo on GitHub. The final project for the course was to conduct a short lesson to a group of instructors that gave feedback, and submit a pull request to update an existing lesson with a fix. Then we are ready to sign up to teach a Software Carpentry course!
In conclusion, data science is a great skill to have if you are managing large distributed systems. Also, using evidence based teaching methods to help others learn is the way to go!
Other fun data science examples include
Tracking down the Villains: Outlier Detection at Netflix - detecting rogue servers with machine learning
Finding Shoe Stores in 100k Merchants: Using Data to Group All Things - finding out what Shopify merchants sell shoes using Apache Spark and more
Looking Through Camera Lenses: The Application of Computer Vision at Etsy