Twitter's hundreds of millions of users share the latest news, ideas, and opinions with each other in real time, making it one of the Web's most popular communication platforms. All that activity renders Twitter a gold mine of metadata -- and has turned Twitter analytics team into a major Hadoop shop that analyzes petabytes of data from in excess of 175 million daily tweets, ad campaigns, and more. The goal of all this analysis is to improve Twitter service for both end-users and advertisers.
With Hadoop, developers define MapReduce jobs that allow a narrowly prescribed task -- typically an analytic task -- to be crunched for immense sets of data across a set of servers. A big challenge for Twitter was that, by default, programming for MapReduce requires special skills beyond Java chops. Developers need to rethink the way they code -- a common criticism of Hadoop.
Furthermore, the Twitter teams needed to be able to perform complex computations, machine learning, and linear algebra, none of which are necessarily intuitive to write in MapReduce for an average developer.
Requiring sophisticated statistical functions for any of their developers to easily implement, Twitter turned to Concurrent's Cascading framework, specifically designed with the creation of big data applications in mind.
Cascading provides a higher level of abstraction for Hadoop, allowing developers to create complex jobs quickly, easily, and in several different languages that run in the JVM, including Ruby, Scala, and more. In effect, this has shattered the skills barrier, enabling Twitter to use Hadoop more broadly.
Today, Cascading enables Twitter developers to create complex data processing workflows in their favorite programming languages while easily scaling to handle petabytes of data. Twitter has signed a contributor agreement with Concurrent so that Twitter's open source contributions are easily applied to the Cascading project.
Three teams dive deep
The revenue team helps advertisers determine which ads are most effective by analyzing the contents of ads, Twitter topics, and so on to help increase conversion rates. They wrote Scalding, the open source Scala API for Cascading, so that developers can write in Scala and run on Hadoop.
The publisher analytics team helps webmasters understand how Twitter users are engaging with information on brands, websites, and online influencers. They built and open-sourced Cascalog, a Clojure-based language that uses Cascading as the job execution engine.
The analytics team's mission is to understand Twitter user activity. They needed a way to make it easier to perform detailed and complex analysis on users following other users or users who follow the same people. They created PyCascading, a Python wrapper for Cascading to control the full data processing workflow from Python.
In all these cases, Cascading shields developers from the underlying complexity of writing, optimizing, and executing MapReduce jobs. It allows each team to deliver highly complex information and functionality needed by the business quickly and efficiently. The end result of all this amounts to important insights Twitter can use to continuously improve its wildly popular service.
This article, "Twitter's programmers speed Hadoop development," was originally published at InfoWorld.com. Read more of Andrew Lampitt's Think Big Data blog, and keep up on the latest developments in big data at InfoWorld.com For the latest business technology news, follow InfoWorld.com on Twitter.