Most credit James Dixon of the open source BI vendor Pentaho with coining the phrase "data lake." Think of a data lake as an unstructured data warehouse, a place where you pull in all of your different sources into one large "pool" of data.
In contrast to a data mart, a data lake won't "wash" the data or try to structure it or limit the use cases. Sure, you should have some use cases in mind, but the architecture of a data lake is simple: a Hadoop File System (HDFS) with lots of directories and files on it.
[ The 10 worst big data practices | Work smarter, not harder -- download the Developers' Survival Guide from InfoWorld for all the tips and trends programmers need to know. | Keep up with the latest developer news with InfoWorld's Developer World newsletter. ]
Why would you want a data lake?
The answers are both technical and political. Usually, when you start up any new project that involves analyzing your company's data -- especially when the data is stored across functional areas -- you're in for trouble. For example, if the business unit that wants the data isn't part of the unit providing the data, what kind of priority do you think the unit providing the data likely assign to the effort? How is it budgeted? Who does the integration and how much needs to be done? How do you structure the data and for what purposes?
Assuming you can sort all that out, when you're done, you have a system that can answer only a few preset questions. The next time you need more, you have a whole new project.
The data lake model turn all this on its head. Getting access to the data doesn't require an integration effort, because the data is already there. To start a new project, you merely request the appropriate role or group access (which in most corporate environments means changing Active Directory group assignments). No major integration effort is required; it's all there in the lake and you can apply MapReduce among other algorithms to start crunching it.
Well, that may be a bit overstated. It isn't that all the data is unstructured, more that we won't perfect a schema as a BDUF (big design up front). You don't know all of the use cases for your data, so how can you know the perfect structure?
Some data is unstructured or not structured by us for a given project, but much of it comes from source systems that structure it differently than we need. A better term for how we store data in the lake is "schema on read" rather than the traditional "schema on write" (or, in some companies, "schema designed months before your first write"). We'll structure the data to the questions rather than attempting to structure the questions to the problems.
How you go about constructing a lake
Remember how we talked about not planning for all use cases? Well, that's true, but it's hard to construct a lake without thinking about any use cases. You should have some in mind. Some may be existing ones, but generally, there is always something that your company wanted to do but couldn't get the data together to execute on. Sometimes you pick obvious, albeit theoretical cases based on your knowledge of the systems you have, the data they contain, and the possibilities for that data.
You'll need to learn some of the Hadoop stack such as Sqoop, Oozie, and Flume -- and obtain feeds from your existing systems. Getting this process under way is the bulk of the grunt work; the rest ends up being more of an intellectual exercise.
Next, find a unicorn (aka data scientist), shoot the unicorn in the head because it's probably a shyster anyhow, and drink its blood Voldemort-style. Actually, you won't have to do that, because data scientists do not exist. Data scientists supposedly know advanced mathematics, artificial intelligence, and computer science, and they understand Hadoop -- as well as business and your business data in particular. In addition, they walk on water, bake gluten-free vegan bread that doesn't taste like sawdust, conjure good spirits, and sell you timeshares cheap.