A statistic published earlier this year caught my eye. According to Gartner, through 2018, 70 percent of Hadoop deployments will fail to meet cost savings and revenue generation objectives due to skills and integration challenges.
Skills and integration challenges. With all these vendors claiming to make big data easy and smooth, where are the difficulties remaining? Let’s look at the first, upstream part of a big data project.
Collecting data
The whole concept of big data, or total data, and how to collect it and get it to the data lake can sound scary, but it becomes less so if you break down the data collection problem into subsets.
- Data from traditional sources: Your transactional systems -- accounting, HR systems, and so on -- are already being used as data sources for analytics. ETL processes are already in place to collect this data. You basically end up with two options. Either duplicate these ETL processes, swapping the target from the EDW to the data lake, or replicate your EDW into the data lake -- physically by copying the data, or virtually by embracing the virtual data lake architecture (a variation of the virtual data warehouse).
- Structured data from the Internet of things: The main complexity with sensor and other machine data is the volume and the throughput required for proper and timely ingestion. But this data is typically very standardized, and upstream data transformation requirements are not immense.
- Unstructured data: Collecting media files, textual data is one thing that is made easy by big data platforms such as Hadoop. Because their storage is schema-less, all that’s needed is to actually “dump” this data in the data lake -- and figure it out later.
Given the proper ETL tools and APIs/connectors, as well as the right throughput, big data collection isn’t the most difficult part of the big data equation.
Storing data
Big data platforms are polymorph -- they can store all kind of data, and this data can be represented and accessed through different prisms. From simple file storage to relaxed-consistency NoSQL databases to Third-Normal-Form and even Fifth-Norm-Form relational databases, from direct read to columnar-style access to transactional SQL, there is an answer to every storage and data access need.
Because of its fundamental design concepts, the platform is infinitely scalable. Provision it in the cloud, and it becomes elastic. Conceptually at least, storing big data is the easiest part of the big data equation.
Where it becomes tricky is how to make it work in reality. From the core Hadoop platform to the commercial distributions to the hybrid platforms offered by database vendors, there are many options, many price points, many different variations of the concept, and many skill levels required.
Using data
Once you have all this data in the data lake, how do you bring it all together? Transforming and reconciling data, ensuring consistency across sources, checking the quality of data -- this is the hard part of the big data story and where there the least automation and help are available.
If you need to build an application on top of a specific data source or to report on top of a consistent data set, many solutions exist that will automate the process and make it seamless.
But cross the boundaries of sources, explore and leverage heterogeneous data, this is where you are on your own. And this is where vendors who claim to make big data easy should step in and help.