Writing Hadoop queries doesn’t have to be hard and neither does sharing data according to Mortar Data, which just released an open source framework for Hadoop applications. The idea is that groups of people can more easily collaborate on building apps around giant data sets.
Programmers, rejoice! The startup world is out to make your life easier when it comes to writing Hadoop jobs or entire applications. The latest next big thing in this endeavor is Mortar Data, which is expanding on its cloud-based, Python-wrapped Hadoop service by releasing the open source Mortar framework for Hadoop applications.
The goal of the framework, says Mortar Data co-founder and CEO K Young, is not just to make application development easier, but also to make it easier to share cool datasets. Inspired by Rails, Mortar is a way to write jobs that process data with Hadoop using pipelines, similar in theory to Cascading. Developers can do all their work within a command line interface, including testing jobs before they run. Mortar supports programming in Python and Ruby, and still uses Apache Pig as the workflow language.
The real beauty, though, might be in how easy Mortar makes it to share datasets. Once people find cool public datasets from cities, governments or other sources, they can they analyze them using the Mortar Data service and then share the code on Github. As long as the dataset is now stored in Amazon Web Services — in either S3 or, now, in MongoDB on EC2 instances — anyone cloning the code from Github will automatically be able to connect to it.
“We’re trying to take lessons from app development … and bring that to working with data,” Young said. He wants Mortar to make working with Hadoop a repeatable, collaborative experience.
Of course, Young noted, this has utility beyond just weekend hackers working on public datasets. Corporate development teams that might have silos even within AWS can easily share their data and workflows with each other, too. As he did when I first spoke with him in April, Young acknowledged Mortar Data might have to extend its service and the new framework to work with on-premises infrastructure and data, but that’s still not in the works right now.
Apart from potentially moving into the data center, Young said the Mortar framework will likely expand beyond batch processing and MapReduce as alternative Hadoop use cases begin to emerge. “Who really knows what Hadoop is?” he asked, referencing YARN, Impala, graph databases and other alternative processing methods built atop Hadoop (some of which have frameworks of their own). Now, however, he thinks there’s a still enough business in batch processing with MapReduce around which to build a business.
Although it’s a bit premature to begin making predictions about which Hadoop tools developers will flock to — we’ve covered platforms such as Infochimps and Continuuity before, and more certainly will emerge — but Young is confident Mortar can grow a big community around it a la SpringSource in the Java application space. After all, despite spending the better part of its existence thus far intentionally holding back on marketing and support in order to focus on engineering, Young said developer uptake has nonetheless been good.
Mortar Data just became an advanced tier AWS partner, he noted, “which means we’re spending a crapload of money on Amazon.”
Feature image courtesy of Shutterstock user isak55.
AltaFlux Corporation is a global HCM cloud consulting partner based in Troy, Michigan. We empower organizations by streamlining, transforming, and optimizing key human capital management (HCM) processes with industry-leading HCM cloud solutions like SAP SuccessFactors, Benefitfocus, WorkForce Software and Dell Boomi.