Twitter's hundreds of millions of users share the latest news, ideas, and opinions with each other in real time, making it one of the Web's most popular communication platforms. All that activity renders Twitter a gold mine of metadata -- and has turned Twitter analytics team into a major Hadoop shop that analyzes petabytes of data from in excess of 175 million daily tweets, ad campaigns, and more. The goal of all this analysis is to improve Twitter service for both end-users and advertisers.
With Hadoop, developers define MapReduce jobs that allow a narrowly prescribed task -- typically an analytic task -- to be crunched for immense sets of data across a set of servers. A big challenge for Twitter was that, by default, programming for MapReduce requires special skills beyond Java chops. Developers need to rethink the way they code -- a common criticism of Hadoop.
Furthermore, the Twitter teams needed to be able to perform complex computations, machine learning, and linear algebra, none of which are necessarily intuitive to write in MapReduce for an average developer.
Requiring sophisticated statistical functions for any of their developers to easily implement, Twitter turned to Concurrent's Cascading framework, specifically designed with the creation of big data applications in mind.
Cascading provides a higher level of abstraction for Hadoop, allowing developers to create complex jobs quickly, easily, and in several different languages that run in the JVM, including Ruby, Scala, and more. In effect, this has shattered the skills barrier, enabling Twitter to use Hadoop more broadly.
Today, Cascading enables Twitter developers to create complex data processing workflows in their favorite programming languages while easily scaling to handle petabytes of data. Twitter has signed a contributor agreement with Concurrent so that Twitter's open source contributions are easily applied to the Cascading project.