The Twelve Days of Christmas (a Data Carol)

On the First day of Christmas, big data gave to me / One Hadoop ecosystem... but many components! Please sing along.

melody

On the First day of Christmas, big data gave to me
One Hadoop ecosystem

Do I really need to explain Hadoop? I didn’t think so. For the ecosystem part, let’s sing on...

***

On the Second day of Christmas, big data gave to me
One Hadoop ecosystem
Two data infrastructures

Here we start. Hadoop or the RDBMS? Or is it Hadoop and the RDBMS? The two data storage infrastructures serve distinct purposes. Better to make them work together than to oppose them.

***

On the Third day of Christmas, big data gave to me
One Hadoop ecosystem
Two data infrastructures
Three processing frameworks

Hadoop’s first incarnation included MapReduce, which is still prevalent for batch processing but doesn’t address interactive uses. The advent of YARN has enabled competing frameworks such as Tez and Spark to provide the real-time answer.

***

On the Fourth day of Christmas, big data gave to me
One Hadoop ecosystem
Two data infrastructures
Three processing frameworks
Four data lakes

Data lakes are all the furor nowadays: just pour data in the lake, and fish for insight. But the data lake architecture does not address critical challenges of data governance, and as such cannot be put in everyone’s hands. (Four is only for the riddle here)

***

On the Fifth day of Christmas, big data gave to me
One Hadoop ecosystem
Two data infrastructures
Three processing frameworks
Four data lakes
Five SQL-on-Hadoop

Built on top of the interactive/real-time frameworks, SQL-on-Hadoop layers started with Hive and now include Stinger, Impala, Hawq, Drill, just to name the main ones. Today there is no clear winner. For Hadoop vendors, this is one of the differentiators between their stack and a plain-vanilla Hadoop implementation.

***

On the Sixth day of Christmas, big data gave to me
One Hadoop ecosystem
Two data infrastructures
Three processing frameworks
Four data lakes
Five SQL-on-Hadoop
Six BI tools

SQL-on-Hadoop not only enables any “traditional” BI tools to run reports and queries against Hadoop, but there are also a number of BI technologies specifically targeting Hadoop and running natively inside Hadoop, taking advantage of the processing power and flexibility brought by YARN.

***

On the Seventh day of Christmas, big data gave to me
One Hadoop ecosystem
Two data infrastructures
Three processing frameworks
Four data lakes
Five SQL-on-Hadoop
Six BI tools
Seven data wrangling tools

Data wrangling, or data preparation, is the latest rage in the big data ecosystem. Aimed at providing non-data-scientists with rich navigation and discovery inside their data, no matter how convoluted and unclean it is. Paxata, Trifacta, Springbok, only represent the first wave of tools, which will likely exceed the count of seven very soon.

***

On the Eight day of Christmas, big data gave to me
One Hadoop ecosystem
Two data infrastructures
Three processing frameworks
Four data lakes
Five SQL-on-Hadoop
Six BI tools
Seven data wrangling tools
Eight data scientists

You should actually consider yourself lucky if you can hire eight of them, let alone pay their exorbitant salaries. Data scientist has become the most sought-after profile in both born-digital companies and in traditional businesses. It is jokingly said that a data scientist is a business analyst who lives in California, but this is not the full story. The data scientist masters technology, to explore and mine data, and possesses a business acumen that enables him to identify data-driven business opportunities.

***

On the Ninth day of Christmas, big data gave to me
One Hadoop ecosystem
Two data infrastructures
Three processing frameworks
Four data lakes
Five SQL-on-Hadoop
Six BI tools
Seven data wrangling tools
Eight data scientists 
Nine Hadoop vendors

Are there really nine Hadoop vendors? In their Wave report (The Forrester Wave: big data Hadoop Solutions, Q1 2014) Forrester lists Cloudera, Hortonworks, MapR, Pivotal, IBM, Amazon, Microsoft, Intel, Teradata. And that’s without counting Apache. Since this publication, Intel invested in Cloudera and announced they would retire their own distribution. Microsoft and Teradata resell others’ distributions. Oracle (also a reseller) seems to be missing from the list. A real moving target...

***

On the Tenth day of Christmas, big data gave to me
One Hadoop ecosystem
Two data infrastructures
Three processing frameworks
Four data lakes
Five SQL-on-Hadoop
Six BI tools
Seven data wrangling tools
Eight data scientists
Nine Hadoop vendors
Ten data-driven applications

Ten is for the riddle only. There are as many data-driven applications are there are data projects. Once data science has played its magic, once data algorithms have been operationalized, monetization starts. Value of data is only obtained when it is fed to applications, mobile apps, web sites...

***

On the Eleventh day of Christmas, big data gave to me
One Hadoop ecosystem
Two data infrastructures
Three processing frameworks
Four data lakes
Five SQL-on-Hadoop
Six BI tools
Seven data wrangling tools
Eight data scientists 
Nine Hadoop vendors
Ten data-driven applications
Eleven NoSQL engines

Like Hadoop distributions, counting NoSQL engines is difficult. However, Gartner identified eleven commercial vendors in a “Who's Who in NoSQL DBMSs” report published last year: MongoDB, Cloudant, Couchbase, MarkLogic, Neo, Objectivity, Aerospike, Basho, Oracle, Redis, DataStax.

***

On the Twelfth day of Christmas, big data gave to me
One Hadoop ecosystem
Two data infrastructures
Three processing frameworks
Four data lakes
Five SQL-on-Hadoop
Six BI tools
Seven data wrangling tools
Eight data scientists 
Nine Hadoop vendors
Ten data-driven applications
Eleven NoSQL engines
Twelve data-driven APIs

Feeding the data-driven apps, providing the glue with the back-end systems, RESTful APIs are the unseen interfaces that ensure reliable, secure and agile connectivity to the data. They also open up monetization capabilities, enable open data, and many more uses of big data.

This article is published as part of the IDG Contributor Network. Want to Join?

From CIO: 8 Free Online Courses to Grow Your Tech Skills
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.