With a Hadoop deployment, every server in the cluster can participate in the processing of the data through Hadoop's capability to spread the work and the data across the cluster. In other words, an indexing job works by sending code to each of the servers in the cluster and each server then operates on its own little piece of the data. Results are then delivered back as a unified whole. With Hadoop, the process is referred to as MapReduce, where the code and processes are mapped to all the servers and the results are reduced into a single set.
That process is what makes Hadoop so good at dealing with large amounts of data. Hadoop spreads the data out and can handle complex computational questions by harnessing all of the available cluster processors to work in parallel.
Understanding Hadoop and extract, transform, and load
However, venturing into the world of Hadoop is not a plug-and-play experience. There are certain prerequisites, hardware requirements and configuration chores that must be met to ensure success. The first step consists of understanding and defining the analytics process. Luckily, most IT leaders are familiar with BA (business analytics) and BI processes and can relate the most common process layer used -- the ETL (extract, transform, and load) layer -- and the critical role it plays when building BA/BI solutions.
Big data analytics requires that organizations choose the data to analyze, consolidate it and then apply aggregation methods before it can be subjected to the ETL process. What's more, that has to occur with large volumes of data, which can be structured, unstructured, or from multiple sources, such as social networks, data logs, websites, mobile devices, sensors, and other areas.
Hadoop accomplishes that by incorporating pragmatic processes and considerations, such as a fault-tolerant clustered architecture and the capability to move computing power closer to the data and perform parallel and/or batch processing of large data sets. It also provides an open ecosystem that supports enterprise architecture layers from data storage to analytics processes.
Not all enterprises require the capabilities that big data analytics has to offer. However, those that do must consider Hadoop's capability to meet the challenge. But Hadoop cannot accomplish everything on its own -- enterprises will need to consider what additional Hadoop components are needed to build a Hadoop project.
For example, a starter set of Hadoop components may consist of HDFS and HBase for data management, MapReduce and Oozie as a processing framework, Pig and Hive as development frameworks for developer productivity and open source Pentaho for BI.
From a hardware perspective, a pilot project does not require massive amounts of equipment thrown at it. Hardware requirements can be as simple as a pair of servers with multiple cores, 24 or more gigabytes of RAM and a dozen or so hard disk drives of two terabytes each, which should prove sufficient to get a pilot project off the ground.
However, be forewarned that effective management and implementation of Hadoop require some expertise and experience, and if that expertise is not readily available, IT management should consider partnering with a service provider that can offer full support for the Hadoop project. That expertise proves especially important when it comes to security. Hadoop, HDFS and HBase offer very little in the form of integrated security, so data still need additional protections against compromise or theft.
All things considered, an in-house Hadoop project makes the most sense for a pilot test of big data analytics capabilities. After the pilot, a plethora of commercial or hosted solutions are available to those who want to tread further into the realm of big data analytics.
Frank J. Ohlhorst is a New York-based technology journalist and IT business consultant.
Read more about business intelligence (bi) in CIO's Business Intelligence (BI) Drilldown.