Do Twitter analysis the easy way with MongoDB

There's gold in all them tweets if you have Hadoop -- or not. For simple Twitter analysis, try MongoDB's aggregation framework

These days, Jonathan Freeman seems to write my posts more often than I do. While I was tied up at the MongoDB World event, Jonathan wrote this great post about a quick and easy way to analyze tweets using MongoDB. It's a great idea using MongoDB features many developers may not be familiar with. I hope you find it useful. -- Andrew Oliver

With all the World Cup excitement, I found myself wondering what the Twitter-scape looked like. Who is tweeting? What are they tweeting about? Where are they? What language are they tweeting in?

Obviously, such questions can apply to any tweet-worthy event. Along with the idly curious like me, various types of businesses from tech startups to local restaurants might want to know: What's my most vocal demographic? What time of day are people tweeting about my service most often? How do people feel about the new website?

Collecting all this data and analyzing it might seem like a big investment, but with the right tools it becomes trivially easy. In this article, I show you how to analyze tweet data using MongoDB as both the data store and the analytics engine. MongoDB has powerful analytics tools and straightforward pluggability into Hadoop for when you have a question that needs a more generic tool. I'm using tweets about the World Cup to demonstrate, but the concepts are generic and can be easily applied to your own data set.

The aggregation framework

MongoDB has long had an implementation of MapReduce, but it's all done in JavaScript and not particularly fast. All the processing happens in V8 (Google's JavaScript engine), which is a black box to MongoDB, so it can't do any optimization on your MapReduce code.

The alternative is to use the MongoDB aggregation framework, which is a data-processing tool based on the concept of pipelines. You're given a handful of simple building blocks that expect one or more documents as input, perform an operation on that input, and output one or more documents. They can then be chained together to form arbitrarily complex queries. Because it's a part of the database and not running in a third-party virtual machine (which is what's happening with V8), it's much faster than the MongoDB MapReduce implementation.

All the pipeline operators (the individual commands that get strung together) are well documented, so I don't go into all of the inner workings here. I do, however, give a quick description of the ones I'm using:

  • $match: Filters incoming documents using the full MongoDB query syntax
  • $sort: Sorts incoming documents based on a given field
  • $limit: Limits the output to a specified number of documents
  • $project: Reshapes documents -- add fields, remove fields, add subdocuments, and so on
  • $group: Aggregates documents on a given field (or fields). It also lets you take counts, sums, averages, and more of fields in the documents being grouped together.
  • $unwind: A handy tool that doesn't have a SQL equivalent. It's designed to operate on an array field of the input documents, creating a copy of a document for each field in the array and replacing the array with a single value instead. Examples usually clarify this one, starting with the input.

  _id: 'myDoc1',
  tags: ['pie', 'chocolate']

The operator would then be:

{$unwind: '$tags'}

Finally, the output:

  _id: 'myDoc1',
  tags: 'pie'
  _id: 'myDoc1',
  tags: 'chocolate'

1 2 3 Page 1