Disk storage is a lot like closet space -- you can never have enough. Nowhere is this truer than in the world of big data. The very name -- "big data" -- implies more data than a typical storage platform can handle. So where exactly does this leave the ever-vigilant CIO? With a multitude of decisions to make and very little information to go by.
However, wading through the storage options for big data does not have to be an impossible journey. It all comes down to combining some basic understanding of the challenge with a little common sense and a sprinkle of budgetary constraint.
[ Also on InfoWorld: 7 top tools for taming big data. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. | Get the latest insight on the tech news that matters from InfoWorld's Tech Watch blog. ]
What makes big data a big deal
First of all, it is important to understand how big data differs from other forms of data and how the associated technologies (mostly analytics applications) work with it. In itself, big data is a generic term that simply means that there is too much data to deal with using standard storage technologies. However, there is much more to it than thatÂ -- big data can consist of terabytes (or even petabytes) of information that can be a combination of structured data (databases, logs, SQL and so) and unstructured (social media posts, sensors, multimedia) data elements. What's more, most of that data can lack indexes or other organizational structures, and may consist of many different file types.
That circumstance greatly complicates dealing with big data. The lack of consistency eliminates standard processing and storage techniques from the mix, while the operational overhead and sheer volume of data make it difficult to efficiently process using the standard server and SAN approach. In other words, big data requires something different: its own platform, and that is where Hadoop comes into the picture.
Hadoop is an open source project that offers a way to build a platform that consists of commodity hardware (servers and internal server storage) formed into a cluster that can process big data requests in parallel. On the storage side, the key component of the project is the Hadoop Distributed File System (HDFS), which has the capability to store very large files across multiple members in a cluster. HDFS works by creating multiple replicas of data blocks and distributing them across compute nodes throughout a cluster, which facilitates reliable, extremely rapid computations.
All things considered so far, it would seem that the easiest way to build an adequate storage platform for big data would be to purchase a set of commodity servers and equip each with a few terabyte-level drives and then let Hadoop do the rest. For a few smaller enterprises, it may be just as simple as that. However, once processing performance, algorithm complexity and data mining enter the picture, a commodity approach may not be sufficient to guarantee success.