How to create a data lake for fun and profit

The idea of data lakes has been fermenting, and now real companies are using them for real analysis. Here's why you might want one -- and how to create it

1 2 Page 2
Page 2 of 2

In reality, the people you need are the people you always need: technically adept facilitators who pull the right people with the right knowledge into a room and work through problems. There is no unicorn; we are all the unicorn together.

Start with basic cases and use simple and familiar tools like Tableau (which can connect to Hive) to make nice charts, graphics, and reports demonstrating that, yes, you can do something useful with the data. Bring more stakeholders to the table and generate new ideas for how you can use the data. Advertise the system and its capabilities throughout the organization.

Consider security up front, as well as who can access what data. This will inform the structure of your directories and file locations on HDFS. Deploy Knox to enforce it because by default HDFS trusts the client the same way that NFS does. The idealist says: "Oh, you have a project, go to the data lake." The realist says: "Oh you have a project, get the right permissions in your data lake." At least you're not faced with a big, fat project where you need to provision a VM, get a feed from the relevant systems, create a schema to hold the data, and on and on.

Start with the core Hadoop platform. Don't get fancy at first. Don't launch a massive AI project that replaces the whole organization with your pipe dream of creating Skynet à la Hadoop. Start with bringing the data analytics to the people and making the data more accessible to them. Find a way to let people go fishing in the lake for what they want.

About relational data

Realistically, you can't dump everything in the data lake without messing with it first. As you work through your use cases, you may find the need to flatten some of your data, especially if it came from a relational source.

While Hadoop scales well, a view of any chart showing how Hive works with joins should give you pause. You may ask: How might I flatten this? For example, take the traditional example of Orders, Order_Items, and Product tables. Anything you do with this data -- except for summarizing orders, which isn't a likely case for analytics -- will join these tables. Why not join them in advance into one flat file?

Even if you summarize orders, filtering out duplicate rows is generally more efficient than joining many, at least up to a point. Even if summarizing was important, there is no reason not to have two views of the data. I mean, what are you doing -- saving disk space? Cheap storage is part of the magic of Hadoop.

Next, expand

Once people start using the data lake and every BI project starts with Hadoop, you can expand your capabilities, adding more external tools and demonstrating capabilities like machine learning and pattern finding with Mahout. Maybe you start streaming data "real time" and adding more processing capabilities with Spark. Maybe you materialize common views in HBase. But don't get derailed along the way. Lake security may have business unit implications, but you shouldn't have a lot of mini lakes (aka data ponds) that are separate and not equal.

If all this still seems a bit confusing, here's the quick and easy version:

  1. Identify a few use cases
  2. Build the lake
  3. Get data from lots of different sources into the lake
  4. Provide a variety of fishing poles and see who lands the biggest and best trout (or generates the most interesting data-backed factoid)

Granted, the more technical analysts will eventually do much of the work, and there is always a risk of misinterpretation. But getting the data in the hands of the people and letting them play with it is good for your lake and your business.

This article, "How to create a data lake for fun and profit," was originally published at Keep up on the latest news in application development and read more of Andrew Oliver's Strategic Developer blog at For the latest business technology news, follow on Twitter.

Copyright © 2014 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
How to choose a low-code development platform