How to create a data lake for fun and profit

The architecture of a data lake is simple, what matters more is how you draw and use the data

big data numbers

Most credit James Dixon of the open source BI vendor Pentaho with coining the phrase "data lake." Think of a data lake as an unstructured data warehouse, a place where you pull in all of your different sources into one large "pool" of data.

In contrast to a data mart, a data lake won't "wash" the data or try to structure it or limit the use cases. Sure, you should have some use cases in mind, but the architecture of a data lake is simple: a Hadoop File System (HDFS) with lots of directories and files on it.

Why would you want a data lake?
The answers are both technical and political. Usually, when you start up any new project that involves analyzing your company's data -- especially when the data is stored across functional areas -- you're in for trouble. For example, if the business unit that wants the data isn't part of the unit providing the data, what kind of priority do you think the unit providing the data likely assign to the effort? How is it budgeted? Who does the integration and how much needs to be done? How do you structure the data and for what purposes?

Assuming you can sort all that out, when you're done, you have a system that can answer only a few preset questions. The next time you need more, you have a whole new project.

The data lake model turns all this on its head. Getting access to the data doesn't require an integration effort, because the data is already there. To start a new project, you merely request the appropriate role or group access (which in most corporate environments means changing Active Directory group assignments). No major integration effort is required; it's all there in the lake and you can apply MapReduce among other algorithms to start crunching it.

Unstructured? Really?
Well, that may be a bit overstated. It isn't that all the data is unstructured, more that we won't perfect a schema as a BDUF (big design up front). You don't know all of the use cases for your data, so how can you know the perfect structure?

Some data is unstructured or not structured by us for a given project, but much of it comes from source systems that structure it differently than we need. A better term for how we store data in the lake is "schema on read" rather than the traditional "schema on write" (or, in some companies, "schema designed months before your first write"). We'll structure the data to the questions rather than attempting to structure the questions to the problems.

How you go about constructing a lake
Remember how we talked about not planning for all use cases? Well, that's true, but it's hard to construct a lake without thinking about any use cases. You should have some in mind. Some may be existing ones, but generally, there is always something that your company wanted to do but couldn't get the data together to execute on. Sometimes you pick obvious, albeit theoretical cases based on your knowledge of the systems you have, the data they contain, and the possibilities for that data.

You'll need to learn some of the Hadoop stack such as SqoopOozie, and Flume -- and obtain feeds from your existing systems. Getting this process under way is the bulk of the grunt work; the rest ends up being more of an intellectual exercise.

Next, find a unicorn (aka data scientist), shoot the unicorn in the head because it's probably a shyster anyhow, and drink its blood Voldemort-style. Actually, you won't have to do that, because data scientists do not exist. Data scientists supposedly know advanced mathematics, artificial intelligence, and computer science, and they understand Hadoop -- as well as business and your business data in particular. In addition, they walk on water, bake gluten-free vegan bread that doesn't taste like sawdust, conjure good spirits, and sell you timeshares cheap.

In reality, the people you need are the people you always need: technically adept facilitators who pull the right people with the right knowledge into a room and work through problems. There is no unicorn; we are all the unicorn together.

Start with basic cases and use simple and familiar tools like Tableau (which can connect to Hive) to make nice charts, graphics, and reports demonstrating that, yes, you can do something useful with the data. Bring more stakeholders to the table and generate new ideas for how you can use the data. Advertise the system and its capabilities throughout the organization.

Consider security up front, as well as who can access what data. This will inform the structure of your directories and file locations on HDFS. Deploy Knox to enforce it because by default HDFS trusts the client the same way that NFS does. The idealist says: "Oh, you have a project, go to the data lake." The realist says: "Oh you have a project, get the right permissions in your data lake." At least you're not faced with a big, fat project where you need to provision a VM, get a feed from the relevant systems, create a schema to hold the data, and on and on.

Start with the core Hadoop platform. Don't get fancy at first. Don't launch a massive AI project that replaces the whole organization with your pipe dream of creating Skynet à la Hadoop. Start with bringing the data analytics to the people and making the data more accessible to them. Find a way to let people go fishing in the lake for what they want.

About relational data
Realistically, you can't dump everything in the data lake without messing with it first. As you work through your use cases, you may find the need to flatten some of your data, especially if it came from a relational source.

While Hadoop scales well, a view of any chart showing how Hive works with joins should give you pause. You may ask: How might I flatten this? For example, take the traditional example of Orders, Order_Items, and Product tables. Anything you do with this data -- except for summarizing orders, which isn't a likely case for analytics -- will join these tables. Why not join them in advance into one flat file?

Even if you summarize orders, filtering out duplicate rows is generally more efficient than joining many, at least up to a point. Even if summarizing was important, there is no reason not to have two views of the data. I mean, what are you doing -- saving disk space? Cheap storage is part of the magic of Hadoop.

Next, expand
Once people start using the data lake and every BI project starts with Hadoop, you can expand your capabilities, adding more external tools and demonstrating capabilities like machine learning and pattern finding with Mahout. Maybe you start streaming data "real time" and adding more processing capabilities with Spark. Maybe you materialize common views in HBase. But don't get derailed along the way. Lake security may have business unit implications, but you shouldn't have a lot of mini lakes (aka data ponds) that are separate and not equal.

If all this still seems a bit confusing, here's the quick and easy version:

  1. Identify a few use cases
  2. Build the lake
  3. Get data from lots of different sources into the lake
  4. Provide a variety of fishing poles and see who lands the biggest and best trout (or generates the most interesting data-backed factoid)

Granted, the more technical analysts will eventually do much of the work, and there is always a risk of misinterpretation. But getting the data in the hands of the people and letting them play with it is good for your lake and your business.

Copyright © 2014 IDG Communications, Inc.

How to choose a low-code development platform