Consider, for instance, a fairly typical non-Hadoop enterprise website that handles commercial transactions. According to Sarah Sproehnle, Director of Educational Services for Cloudera, the logs from one of their customer's popular sites would undergo an ETL (extract, transform, and load) procedure on a nightly run that could take up to three hours before depositing the data in a data warehouse, at which time a stored procedure would be kicked off and (after another two hours) the cleansed data would reside in the data warehouse. The final data set, though, would only be a fifth of its original size -- meaning that if there was any value to be gleaned from the entire original data set, it would be lost.
After Hadoop was integrated into this organization, things improved dramatically in terms of time and effort. Instead of undergoing an ETL operation, the log data from the Web servers was sent straight to the HDFS within Hadoop in its entirety. From there, the same cleansing procedure was performed on the log data, only now using MapReduce jobs. Once cleaned, the data was then sent to the data warehouse. But the operation was much faster, thanks to the removal of the ETL step and the speed of the MapReduce operation. And, all of the data was still being held within Hadoop -- ready for any additional questions the site's operators might come up with later.
This is a critical point to understand about Hadoop: It should never be thought of as a replacement for your existing infrastructure, but rather as a tool to augment your data management and storage capabilities. Using tools like Apache Flume, which can pull data from RDBMS to Hadoop and back, or Apache Sqoop, which can extract system logs in real time to Hadoop, you can connect your existing systems with Hadoop and have your data processed no matter the size. All you need to do is add nodes to Hadoop to handle the storage and the processing.
Required hardware and costs
So how much hardware are we talking?
Estimates for the hardware needed for Hadoop vary a bit, depending on who you ask. Cloudera's list is detailed and specific on what a typical slave node for Hadoop should be:
- Midrange processor
- 4GB to 32GB of memory
- 1 GbE network connection to each node, with a 10 GbE top-of-rack switch
- A dedicated switching infrastructure to avoid Hadoop saturating the network
- 4 to 12 drives per machine, non-RAID
Hortonworks, another Hadoop distributor, has similar specs, though it is a little more vague on the network stats, because of the varying workloads any given organization can apply to their Hadoop instance.
"As a rule of thumb, watch the ratio of network-to-computer cost and aim for network cost being somewhere around 20 percent of your total cost. Network costs should include your complete network, core switches, rack switches, any network cards needed, etc.," wrote Hortonworks CTO Eric Baldeschwieler.
For its part, Cloudera estimates anywhere from $3,000 to $7,000 per node, depending on what you settle on for each node.
Sproehnle also outlined a fairly easy to follow rule-of-thumb for planning your Hadoop capacity. Because Hadoop is linearly scalable, you will increase your storage and processing power whenever you add a node. That makes planning straightforward.