If your data is growing by 1TB a month, for instance, then here's how to plan: Hadoop replicates data three times, so you will need 3TB of raw storage space to accommodate the new terabyte. Allowing a little extra space (Sproehnle estimates 30 percent overhead) for processing operations of data, that puts the actual need at 4TB that month. If you're using 4 X 1TB drive machines for your nodes, that's 1 new node per month.
The nice thing is that all new nodes are immediately put to use when connected, getting you X times the processing and storage, where X is the number of nodes.
Installing and managing Hadoop nodes is not exactly nontrivial, though, but there are many tools out there that can help. Cloudera Manager, Apache Ambari (which is what Hortonworks uses for its management system), and the MapR Control System are all equally effective Hadoop cluster managers. If you are using a "pure" Apache Hadoop solution, you can also look at Platform Symphony MapReduce, StackIQ Rocks + Big Data, and Zettaset Data Platform a third-party Hadoop management systems.
This is just the tip of the iceberg, of course, when it comes to deploying a Hadoop solution for your organization. Perhaps the biggest take-away is understanding that Hadoop is not meant to replace your current data infrastructure, only augment it.
Once this important distinction is made, it becomes easier to start thinking about how Hadoopp can help your organization, without ripping out the guts of your data processes.
Read more of Brian Proffitt's Zettatag and Open for Discussion blogs and follow the latest IT news at ITworld. Drop Brian a line or follow Brian on Twitter at @TheTechScribe. For the latest IT news, analysis, and how-tos, follow ITworld on Twitter and Facebook.