By Charles Zedlewski
Ever since Cloudera decided to contribute the code and resources for what would later become Apache Bigtop (incubating), we’ve been answering a very basic question: what exactly is Bigtop and why should you or anyone in the Apache (or Hadoop) community care? The earliest and the most succinct answer (the one used for the Apache Incubator proposal) simply stated that “Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem”. That was a nice explanation of how Bigtop relates to the rest of the Apache Software Foundation’s (ASF) Hadoop ecosystem projects, yet it doesn’t really help you understand the aspirations of Bigtop.
Cloudera was the first company to create an open source distribution that included Apache Hadoop, releasing the first version (CDH1) back in March, 2009. The initial goal of CDH was to make Apache Hadoop easier to adopt, providing packaging to enable users to install Hadoop on popular Linux operating systems and not have to compile from source.
In mid-2010 Cloudera announced a major change in CDH that eventually came to recast what defined an Apache Hadoop based distribution. We observed that users were typically running not just Apache Hadoop but also a collection of other open source systems and components that were quickly becoming essential to have a fully functioning data management system. But in order to run such a system, organizations needed to do a great deal of work: assembling and integrating sometimes as many as a dozen different components. Each open source component had its own release schedule, dependencies, interfaces and standards for quality.
CDH3 was the first time a great many of these components were provided together all as an integrated system. Since that time we’ve updated the distribution on a regularly quarterly schedule and recently released a new major version (CDH4).
That notion of a Hadoop distribution has become the industry’s prevailing definition:
Today, all providers of Apache Hadoop distributions essentially follow this model and many in fact simply choose to redistribute CDH.
Building and supporting CDH taught us a great deal about what was required to be able to repeatedly assemble a truly integrated, Apache Hadoop based data management system. The build, testing and packaging cost was considerable, and we regularly observed that different projects made different design choices that made ongoing integration difficult. We also realized that more and more mission critical workload was running on CDH and the customer demand for stability, predictability and compatibility was increasing.
Apache Bigtop was part of our answer two solve these two different problems. Initiate an Apache open source project that focused on creating the testing and integration infrastructure of an Apache-Hadoop based distribution. With it we hoped that:
This would enable us to make progress faster, iterating quickly with new releases of all the projects included in the distribution without worrying about a high rate of change or compatibility breaking that would be difficult to inject into our stable, supported enterprise distribution.
It’s been nearly 1 year since Apache Bigtop was proposed to the incubator and we’ve been thrilled with the progress. There have been 4 releases so far, keeping with a goal of delivering fixed-time, variable scope “train” releases. The project started with a diverse range of contributors and this diversity has broadened over time. We’ve seen new contributions from various corporate sponsors but more importantly from members of related communities. Apache Hama was added to Bigtop for example and a member of that community was added to the project in the process. There’s been a similar investment to add Apache Giraph (incubating) to the project.
The overall rate of activity within the Apache Bigtop is accelerating. More patches are contributed each month. More individuals are joining the user and developer lists. This project comes at an important time in the evolution of the Hadoop stack. There are more than half a dozen new projects that have recently spawned to extend the feature set of the Apache Hadoop stack and Bigtop represents an opportunity to integrate more of them more quickly into the context of a larger more strategic data management system.
If you are:
Apache Bigtop (incubating) is still a very young project. We have some ambitious goals in mind, but we can’t possibly achieve them without your help. We need your feedback and we need your involvement. As always, patches are welcome.
Subscribe to the blog to receive updates about:
AltaFlux understands what you and your organization need to excel, and can deliver rapid innovation to unleash your full workforce potential. Together, we can empower your business by streamlining, transforming, and optimizing your key HCM and talent processes with industry-leading SAP SuccessFactors technology—enabling you to adapt at the speed of change.
AltaFlux Corporation is a global HCM cloud consulting partner based in Troy, Michigan. We empower organizations by streamlining, transforming, and optimizing key human capital management (HCM) processes with industry-leading HCM cloud solutions like SAP SuccessFactors, Benefitfocus, WorkForce Software and Dell Boomi.