Storage technology has evolved and matured to the point where it has started to approach commodity status in many data centers. Nevertheless, today's enterprises are faced with evolving needs that can strain storage technologies -- a case in point is the push for big data analytics, an initiative that brings business intelligence (BI) capabilities to large data sets.
However, the big data analytics process demands capabilities that are usually beyond the typical storage paradigms -- simply put, traditional storage technologies, such as SANs, NAS and others cannot natively deal with the terabytes and petabytes of unstructured information that come with the big data challenge. Success with big data analytics demands something more -- a new way to deal with large volumes of data -- in other words, a new storage platform ideology.
[ Explore the current trends and solutions in BI with InfoWorld's interactive Business Intelligence iGuide. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]
Let's hear it for Hadoop
Enter Hadoop, an open source project that offers a platform to work with big data. Although Hadoop has been around for some time, more and more businesses are just now starting to leverage its capabilities.
The Hadoop platform is designed to solve problems caused by massive amounts of data, especially data that contain a mixture of complex, unstructured and structured information, which does not lend itself well to being placed in tables. Hadoop works well in situations that require support of analytics that are deep and computationally extensive, like clustering and targeting.
So what exactly does Hadoop mean for IT professionals seeking to leverage big data? The simple answer is that Hadoop solves the most common problem associated with big data: efficiently storing and accessing large amounts of data.
The intrinsic design of Hadoop allows it to run as a platform that is able to work across a large number of machines that don't share any memory or disks. With that in mind, it becomes easy to see how Hadoop offers additional value -- network managers can simply buy a number of commodity servers, place them in a rack, and run the Hadoop software on each one.
What's more, Hadoop helps to remove much of the management overhead associated with large data sets. Operationally, as an organization's data is being loaded into a Hadoop platform, the software breaks down the data into manageable pieces, which are then automatically spread across different servers. The distributed nature of the data means there is no one single place to go to access the data. Hadoop keeps track of where the data resides, and further protects that information by creating multiple copy stores. Resiliency is enhanced, because if a server goes offline or fails, the data can be automatically replicated from a known good copy.
How Hadoop goes further
The Hadoop paradigm goes several steps further when it comes to working with data. Take, for example, the limitations associated a traditional, centralized database system, which may consist of a large disk drive connected to a server class system that features multiple processors. In that scenario, analytics is limited by the performance of the disk and, ultimately, the number of processors that can be bought to bear.