There are many proprietary and open source resources for these tools, often from startups but also from established cloud technology companies such as Amazon.com and Google -- in fact, use of the cloud helps solve the big data scalability issue, both for data storage and computational capability. However, big data does not necessarily have to be a "roll your own" type of deployment. Large vendors such as IBM and EMC offer tools for big data projects, though their costs can be high and hard to justify.
Hadoop: The core of most big data efforts
In the open source realm, the big name is Hadoop, a project administered by the Apache Software Foundation that consists of Google-derived technologies for building a platform to consolidate, combine, and understand data.
Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce. The goal of those services is to provide a foundation where the fast, reliable analysis of both structured and complex data becomes a reality. In many cases, enterprises deploy Hadoop alongside their legacy IT systems, which allows them to combine old and new data sets in powerful new ways. Hadoop allows enterprises to easily explore complex data using custom analyses tailored to their information and questions.
Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data -- and run large-scale, high-performance processing jobs -- in spite of system changes or failures.
Although Hadoop provides a platform for data storage and parallel processing, the real value comes from add-ons, cross-integration, and custom implementations of the technology. To that end, Hadoop offers subprojects, which add functionality and new capabilities to the platform:
- Hadoop Common: The common utilities that support the other Hadoop subprojects.
- Chukwa: A data collection system for managing large distributed systems.
- HBase: A scalable, distributed database that supports structured data storage for large tables.
- HDFS: A distributed file system that provides high throughput access to application data.
- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- MapReduce: A software framework for distributed processing of large data sets on compute clusters.
- Pig: A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper: A high-performance coordination service for distributed applications.
Most implementations of a Hadoop platform will include at least some of these subprojects, as they are often necessary for exploiting big data. For example, most organizations will choose to use HDFS as the primary distributed file system and HBase as a database, which can store billions of rows of data. And the use of MapReduce is almost a given since its engine brings speed and agility to the Hadoop platform.