With a billion users and requirements to analyze more than 105 terabytes every 30 minutes, Facebook's appetite for crunching data has reached Godzilla-like proportions.
Much of that data -- including data related to 2.7 billion Likes and 2.5 billion content items shared per day -- is devoured and analyzed in order to optimize product features and Facebook advertising performance. Hadoop is the key tool Facebook uses, not simply for analysis, but as an engine to power many features of the Facebook site, including messaging. That multitude of monster workloads drove the company to launch its Prism project, which supports geographically distributed Hadoop data stores.
[ Download the Big Data Analytics Deep Dive by InfoWorld's David Linthicum for a comprehensive, practical overview of this booming field. | Harness the power of Hadoop with InfoWorld's 7 top tools for taming big data. ]
Thanks to such techniques as processing the results of A/B testing on Hadoop, Facebook can determine the efficacy of features or advertisements against each other for a very specific geography or demographic, including gender, age, interests, and so on. With positive results, the company can tweak features and improve targeting.
Facebook's business analysts push the business in a variety of ways. They rely heavily on Hive, which enables them to use Hadoop with standard business intelligence tools, as well as Facebook's homegrown, closed source, end-user tool, HiPal. Hive, an open source project Facebook created, is the most widely used access layer within the company to query Hadoop using a subset of SQL. To make it even easier for business people, the company created HiPal, a graphical tool that talks to Hive and enables data discovery, query authoring, charting, and dashboard creation.
In terms of raw Hadoop capacity, Facebook has reached the upper limit. The company recently declared itself as the owner of what is likely the world's largest Hadoop cluster, weighing in at 100 petabytes. And yet, the company says, that's not big enough. Facebook's Prism project is the company's answer to reach new heights in Hadoop capacity.
The problem is that Hadoop must confine data to one physical data center location. Although Hadoop is a batch processing system, it's tightly coupled, and it will not tolerate more than a few milliseconds delay among servers in a Hadoop cluster. With Prism, a logical abstraction layer is added so that a Hadoop cluster can run across multiple data centers, effectively removing limits on capacity.
This visualization, courtesy of the University of Nebraska, shows data transfers inside a Hadoop cluster. The streams going straight up represent transfers going in and out of the Internet, while the color coding shows processing activity -- the redder the segment, the more active it is.
Facebook says it will open-source Prism soon. For the business world, this could be as big a move as Yahoo's original release of Hadoop as open source back in 2006. Although it's not clear just how practical Prism would be for all but the largest companies, the same questions were raised about Hadoop and NoSQL just a short time ago.
There's a certain urgency behind Facebook's technology development given the company's disappointing Wall Street performance, tempered only recently with good news about mobile revenue. Whether the company can continue to monetize mobile or generate sufficient revenue from fee-based services to supplement advertising are pressing questions. Whatever the model, Hadoop-driven analytics will be the company's big data technology of choice. And with new projects like Prism, limits that would have seemed unfathomable just a few years ago are falling aside.
This article, "Facebook pushes the limits of Hadoop," was originally published at InfoWorld.com. Read more of Andrew Lampitt's Think Big Data blog, and keep up on the latest developments in big data at InfoWorld.com For the latest business technology news, follow InfoWorld.com on Twitter.