Many of the articles I write are based on projects I'm currently engaged in. Recently, for example, I've found myself recruited in the war against the almighty SAN. You see, with big data projects involving Hadoop, when it's time to procure hardware, you have to do something that many IT organizations haven't done in years: Buy servers with local disks.
That's because the "locality" of resources is central to Hadoop's performance -- while a SAN, by definition, consolidates storage on its own network. Yet buying servers with local disks flies in the face of IT organizations' nearly decade-old practice of purchasing only diskless blades and virtualized storage. Tradition dies hard, which is why some of us have reluctantly said, "Yes, you can run Hadoop with a SAN," then added, under the breath, "... but you shouldn't."
[ Also on InfoWorld: 18 essential Hadoop tools for crunching big data | Work smarter, not harder -- InfoWorld has the tips and trends programmers need to know in the Developers' Survival Guide. Download the PDF today! | Keep up with the latest developer news with InfoWorld's Developer World newsletter. ]
I've done this myself, figuring we'd kick off the project and show how we could "optimize" to local disks later. Let me say this unequivocally: You absolutely should not use a SAN or NAS with Hadoop. To understand why this is such a terrible idea, you have to understand a little about how MapReduce and HDFS work.
First off, HDFS is a distributed file system. Think of it as RAID over the Internet. What if you had a 10GB file and 10 servers, and each disk could burst 1GBps? Assume your Ethernet is also 10GB. Well, if you read the file from one server at one time and got back 1GBps, it would take 10 seconds. What if you could read all 10GB at once? In essence, that is what HDFS allows: a burst from your cluster that's bigger than the burst you could get from any individual node.
Secondly, the principal idea behind MapReduce is that the problem is broken up into pieces and sent to each node. From there, the answers are calculated in parallel, then sent back and combined (reduced). If you add in network hops and latency, along with multiple nodes contending for the same resources, you sort of defeat the "high performance" reason for choosing Hadoop in the first place.
Sure, you can make it not so bad by tacking each server to a different vPath and so on, but you're still defiling your sports car with cheapo ethanol econo-gas, an automatic transmission, and $40 tires. It'll work, but why didn't you just buy a sensible file server or RDBMS and go home?
Unfortunately, core IT doesn't like special cases -- so mark my words, conventional thinking about where storage should go presents a key opportunity for EMC and other vendors. Look for appliances that shove bits of Hadoop down to the hardware layer in a hybrid SAN/server setup to come out in the coming year or two. For now, however, stick to your guns and keep your HDFS distributed to local high-performance disks.
This article, "Never, ever do this to Hadoop," was originally published at InfoWorld.com. Keep up on the latest news in application development and read more of Andrew Oliver's Strategic Developer blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.