The fabric of your storage
It all comes down to the fabric involved and the performance of the network. For organizations frequently analyzing big data, a separate infrastructure may be warranted, simply because as the number of compute nodes in a cluster grows, so does the communication overhead. Typically, a multimode compute cluster using HDFS will create a great deal of traffic across the network backbone while processing big data. That occurs because Hadoop spreads the data (along with the compute resources) across the member servers of the cluster.
In most cases, server-based local storage is not the picture of efficiency, which is why many organizations turn to SANs that use a high-speed fabric to maximize throughput. However, the SAN approach might not lend itself well to big data implementations -- especially those using Hadoop -- simply because a SAN takes on the role of centralizing the data on the spindles that make up the SAN, which in turn means that each compute server will need to access the same SAN to retrieve data that would be normally distributed.
Nevertheless, when comparing local server storage to SAN-based storage for Hadoop, local storage wins in two very important ways: cost and overall performance. Simply put, raw disks without RAID placed in each compute member will collectively outperform a SAN when processing requests under HDFS. However, there is a downside to server-based disks, and that comes in the form of scalability.
The question becomes how you add more capacity when needed when the servers rely on local storage. Typically, there are two ways to handle that dilemma. The first is to add additional servers with more local storage. The second is to increase the capacity on the member servers. Both options require the purchase and provisioning of hardware, which can introduce downtime and may require a redesign of the architecture in place. Nonetheless, using either approach should prove to be significantly cheaper than adding capacity to a SAN, and that proves to be a notable benefit.
However, there are other options for storage when it comes to Hadoop. For example, several leading storage vendors are building storage appliances specifically designed for Hadoop and big data analytics. That list includes EMC, which is now offering Hadoop solutions such as the Greenplum HD Data Computing Appliance. Oracle is looking to take it one step further with the Exadata series of appliances, which offer compute power as well as high-speed storage.
Finally, another option exists for those looking to leverage big data, and that comes in the form of the cloud. Companies such as Cloudera, Microsoft, Amazon and many others are offering cloud-based big data solutions, which provide processing power, storage and support.
Making a decision about a big-data storage solution comes down to how much space is needed, how frequently analytics will be performed and what type of data is to be processed. Those factors, as well as security, budget and processing time should all be considered before investing in big data.
However, it is probably safe to say that a pilot project could be a good starting point, and commodity hardware proves to be a low-cost investment for a big-data pilot.
Frank J. Ohlhorst is a New York-based technology journalist and IT business consultant.
Read more about business intelligence (BI) in CIO's Business Intelligence (BI) Drilldown.