Last week at Strata + Hadoop World 2012, we announced a new data science training and certification program. I am very excited to have been part of the team that put the program together, and I would like to answer some of the most frequently asked questions about the course and the certification that we will be offering.
The primary bottleneck on the success of Hadoop is the number of people who are capable of using it effectively to solve business problems. Addressing that bottleneck with training has always been a very large part of our mission here at Cloudera, and we are very fortunate to have one of the best training teams anywhere. So far, we have trained over 15,000 Hadoop developers and administrators, and our courses and certification exams are available all over the world.
Right now, one of the biggest barriers to the widespread adoption of Hadoop is the supply of data scientists, the peculiar blend of software engineer and statistician that is capable of turning data into awesome. We’ve started to see data science courses develop at universities like Columbia, The University of Washington, and UC Berkeley(taught by Cloudera co-founder Jeff Hammerbacher). While these courses provide excellent instruction to a new generation of data scientists, the instruction they provide is necessarily limited to the students who are enrolled in those institutions, and the need for data science training is much broader and much more immediate.
Earlier this year, Jeff and I started working with Cloudera’s training team to distill our experiences at Facebook and Google into a course that would teach the fundamentals of data science: everything from the pragmatic application of machine learning and statistics to business problems to the data ingest and preparation that is so critical in our work. We hope that by sharing our experience and showing how we take advantage of Hadoop to solve problems, we can help address the shortage of data scientists.
First, we want the course to cover the lifecycle of a data science project, from data acquisition and preparation through model development, production deployment, and evaluation. If you want to sample from a grab bag of methods in machine learning and statistics, there are lots of courses to choose from; we wanted our course to teach students how to build data products.
Second, we want our students to understand that data science isn’t nearly as difficult as it is made out to be. It does involve some new tools and a different way of thinking about problems, but it doesn’t require any skills that can’t be taught to a motivated student and then improved upon with practice.
Third, we want data scientists to understand that they are force multipliers within an organization, and that everything they do should be oriented towards making everyone- decision makers, suppliers, and customers- more effective at using data to make decisions.
Recommender systems are an ideal way to learn about data science with Hadoop, if only because of how simply and clearly a recommendation engine can demonstrate the unreasonable effectiveness of data. But that isn’t the only reason we wanted to build the course around recommendation engines:
The course is appropriate for software engineers, statisticians, and business analysts who are familiar with basic Hadoop commands, Hive, and a scripting language like Python, Perl, or Ruby (the labs in the course use Python). There isn’t any Java programming in the course, but we do discuss and make use of Mahout’s commandline tools to create recommendations. We will also show you how to use R to visualize data and perform simple data analysis tasks.
Data science is an interdisciplinary field, which means that there will be parts of the course that will be more or less familiar to you depending on your background and experience. We also want to emphasize the importance of communication and teamwork in data science: there will be some labs where you guide other students, and others where you may need help from students who have more experience. This is very much by design; no single person is an expert at every aspect of data science, and learning how to work as part of a multidisciplinary team is crucial.
Certifying data scientists is difficult, as the ability to create data products is the real mark of a practicing data scientist.
Cloudera is going to do something new for our data science certification program: we will be combining a written exam that ensures students have a basic set of skills and knowledge with a hands-on exam that is designed to measure both technical ability and the capacity to develop creative approaches to building data products. You won’t be required to take the data science course in order to take the data science certification exam, but it will certainly help. We will be announcing more details about the certification exam process in January, after we’ve had our first cohort of students go through the data science course.
I’m so glad you asked. I have the pleasure of teaching the first course in the Bay Area myself, Nov. 14-16, 2012, and our training team will offer the second course in New York on Dec. 12-14, 2012. We will be teaching the course in additional locations based on demand; you can keep an eye on the schedule of public training courses here, and we’re always happy to do onsite training classes that are optimized for the needs of your team. We look forward to seeing you in class.
Subscribe to the blog to receive updates about:
AltaFlux understands what you and your organization need to excel, and can deliver rapid innovation to unleash your full workforce potential. Together, we can empower your business by streamlining, transforming, and optimizing your key HCM and talent processes with industry-leading SAP SuccessFactors technology—enabling you to adapt at the speed of change.
AltaFlux Corporation is a global HCM cloud consulting partner based in Troy, Michigan. We empower organizations by streamlining, transforming, and optimizing key human capital management (HCM) processes with industry-leading HCM cloud solutions like SAP SuccessFactors, Benefitfocus, WorkForce Software and Dell Boomi.