Yes, you can haz big data. However, you can haz it the right way or the wrong way. Here are the top 10 worst practices to avoid.
1. Choosing MongoDB as your big data platform. Why am I picking on MongoDB? I'm not, but for whatever reason, the NoSQL database most abused at this point is MongoDB. While MongoDB has an aggregation framework that tastes like MapReduce and even a (very poorly documented) Hadoop connector, its sweet spot is as an operational database, not an analytical system.
[ Andrew C. Oliver answers the question on everyone's mind: Which freaking database should I use? | Also on InfoWorld: The time for NoSQL standards is now | Get a digest of the key stories each day in the InfoWorld Daily newsletter. ]
When your sentence begins, "We will use Mongo to analyze ...," stop right there and think about what you're doing. Sometimes you really mean "collect for later analysis," which might be OK, depending on what you're doing. However, if you really mean you're going to use MongoDB as some kind of sick data-warehousing technology, your project may be doomed at the start.
2. Using RDBMS schema as files. Yeah, you dumped each table from your RDBMS into a file. You plan to store that on HDFS. You plan to use Hive on it.
First off, you know Hive is slower than your RDBMS for anything normal, right? It's going to MapReduce even a simple select. Look at the "optimized" route for "table" joins. Next, let's look at row sizes -- whaddaya know, you have flat files measured in single-digit kilobytes. Hadoop does best on large sets of relatively flat data. I'm sure you can create an extract that's more denormalized.
3. Creating data ponds. On your way to creating a data lake, you took a turn off a different overpass and created a series of data ponds. Conway's law has struck again and you've let each business group not only create their own analysis of the data but their own mini-repositories. That doesn't sound bad at first, but with different extracts and ways of slicing and dicing the data, you end up with different views of the data. I don't mean flat versus cube -- I mean different answers for some of the same questions. Schema-on-read doesn't mean "don't plan at all," but it means "don't plan for every question you might ask."
Nonetheless, you should plan for the big picture. If you sell widgets, there is a good chance someone's going to want to see how many, to whom, and how often you sold widgets. Go ahead and get that in the common formats and do a little up-front design to make sure you don't end up with data ponds and puddles owned by each individual business group.
4. Failing to develop plausible use cases. The idea of the data lake is being sold by vendors to substitute for real use cases. (It's also a way to escape the constraints of departmental funding.) The data-lake approach can be valid, but you should have actual use cases in mind. It isn't hard to come up them in most midsize to large enterprises. Start by reviewing when someone last said, "No, we can't, because the database can't handle it." Then move on to "duh." For instance, "business development" isn't supposed to be just a titular promotion for your top salesperson; it's supposed to mean something.
What about, say, using Mahout to find customer orders that are common outliers? In most companies, most customer orders resemble each other. But what about the orders that happen often enough but don't match common ones? These may be too small for salespeople to care about, but they may indicate a future line of business for your company (that is, actual business development). If you can't drum up at least a couple of good real-world uses for Hadoop, maybe you don't need it after all.
5. Thinking Hive is the be-all, end-all. You know SQL. You like SQL. You've been doing SQL. I get it, man, but maybe you can grow, too? Maybe you should reach deep down a decade or three and remember the young kid who learned SQL and saw the worlds it opened up for him. Now imagine him learning another thing at the same time.