Twitter's programmers speed Hadoop development

Cascading simplifies Twitter's MapReduce operations in Scala, Clojure, and Jython

Twitter's hundreds of millions of users share the latest news, ideas, and opinions with each other in real time, making it one of the Web's most popular communication platforms. All that activity renders Twitter a gold mine of metadata -- and has turned Twitter analytics team into a major Hadoop shop that analyzes petabytes of data from in excess of 175 million daily tweets, ad campaigns, and more. The goal of all this analysis is to improve Twitter service for both end-users and advertisers.

With Hadoop, developers define MapReduce jobs that allow a narrowly prescribed task -- typically an analytic task -- to be crunched for immense sets of data across a set of servers. A big challenge for Twitter was that, by default, programming for MapReduce requires special skills beyond Java chops. Developers need to rethink the way they code -- a common criticism of Hadoop.

Furthermore, the Twitter teams needed to be able to perform complex computations, machine learning, and linear algebra, none of which are necessarily intuitive to write in MapReduce for an average developer.

Taming Hadoop

Requiring sophisticated statistical functions for any of their developers to easily implement, Twitter turned to Current's Cascading framework, specifically designed with the creation of big data applications in mind.

Cascading provides a higher level of abstraction for Hadoop, allowing developers to create complex jobs quickly, easily, and in several different languages that run in the JVM, including Ruby, Scala, and more. In effect, this has shattered the skills barrier, enabling Twitter to use Hadoop more broadly.

Today, Cascading enables Twitter developers to create complex data processing workflows in their favorite programming languages while easily scaling to handle petabytes of data. Twitter has signed a contributor agreement with Concurrent so that Twitter's open source contributions are easily applied to the Cascading project.

Three teams dive deep

Three Twitter teams are using Cascading in combination with programming languages: The revenue team uses Scala, the publisher analytics team uses Clojure, and the analytics team uses Jython.

The revenue team helps advertisers determine which ads are most effective by analyzing the contents of ads, Twitter topics, and so on to help increase conversion rates. They wrote Scalding, the open source Scala API for Cascading, so that developers can write in Scala and run on Hadoop.

The publisher analytics team helps webmasters understand how Twitter users are engaging with information on brands, websites, and online influencers. They built and open-sourced Cascalog, a Clojure-based language that uses Cascading as the job execution engine.

The analytics team's mission is to understand Twitter user activity. They needed a way to make it easier to perform detailed and complex analysis on users following other users or users who follow the same people. They created PyCascading, a Python wrapper for Cascading to control the full data processing workflow from Python.

In all these cases, Cascading shields developers from the underlying complexity of writing, optimizing, and executing MapReduce jobs. It allows each team to deliver highly complex information and functionality needed by the business quickly and efficiently. The end result of all this amounts to important insights Twitter can use to continuously improve its wildly popular service.

This article, "Twitter's programmers speed Hadoop development," was originally published at Download InfoWorld's Big Data Analytics Deep Dive for a comprehensive, practical overview of this booming field.

Want more Java enterprise news? Get the

JavaWorld Enterprise Java newsletter

delivered to your inbox.

Copyright © 2013 IDG Communications, Inc.