Do Twitter analysis the easy way with MongoDB

There's gold in all them tweets if you have Hadoop -- or not. For simple Twitter analysis, try MongoDB's aggregation framework

Page 3 of 3

Now, if you were feeding these results through to some service that consumes JSON, you'd probably want to rename the _id field to clarify what the data actually means. Instead of doing application level processing to do this, you can add another step in the pipeline:

The query:

db.allTweets.aggregate([
{ $group: {
_id: '$lang',
count: {$sum: 1}
}},

{ $match: {
count: { $gt: 10000 }
}},

{ $sort: {
count: -1
}},

{ $project: {
language: '$_id',
count: 1,
_id: 0
}}
]);

The result:

{ "count" : 516745, "language" : "en" }
{ "count" : 262056, "language" : "es" }
{ "count" : 55117, "language" : "pt" }
{ "count" : 36122, "language" : "ar" }
{ "count" : 30003, "language" : "fr" }
{ "count" : 24851, "language" : "ja" }
{ "count" : 17930, "language" : "in" }
{ "count" : 15876, "language" : "it" }

In the $project operator, I'm creating a new field that has the value of the _id field, specifying that I want to keep the count field but not the _id field anymore. For most fields, you can simply not mention them in the $project operator if you don't want them anymore. The only exception is _id, and you have to explicitly exclude it if you don't want to pass it along.

Dealing with arrays
Let's change course and see what are the top five hashtags being used. On each document, the hashtags are included in a subdocument that's an array. To group on the hashtag text, you have to get them out of the array before you can group on them.

The query:

db.allTweets.aggregate([
{$unwind: '$entities.hashtags'},

{ $group: {
_id: '$entities.hashtags.text',
tagCount: {$sum: 1}
}},

{ $sort: {
tagCount: -1
}},

{ $limit: 5 }

The result:

{ "_id" : "WorldCup", "tagCount" : 399139 }
{ "_id" : "Brasil2014", "tagCount" : 172419 }
{ "_id" : "worldcup", "tagCount" : 98049 }
{ "_id" : "CRC", "tagCount" : 70970 }
{ "_id" : "FRA", "tagCount" : 67226 }

As a pipeline, it's pretty easy to string together simple building blocks to get arbitrarily complex queries. What if you want to know who the top five tweeters are, what hashtags they use, and how many times they use them? All you'd have to do is use these simple building blocks to work the data into what you want it to be. Instead of including that query here, I'll let it be an open challenge for the adventurous. Tweet me a GitHub gist with an answer at @freethejazz.

Hopefully you've seen that it's easy to start getting the answers to important questions using MongoDB and the aggregation framework. Of course, depending on your needs, the framework might not be the only solution. If you wanted to start doing sentiment analysis on the tweets, for example, you're not going to do it with MongoDB. You could easily plug into Hadoop, using the MongoDB connector for Hadoop, and do the heavy lifting on a Hadoop cluster. For many reporting tasks, however, the aggregation framework will do everything you need it to.

This article, "Do Twitter analysis the easy way with MongoDB," was originally published at InfoWorld.com. Keep up on the latest news in application development and read more of Andrew Oliver's Strategic Developer blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

| 1 2 3 Page 3
From CIO: 8 Free Online Courses to Grow Your Tech Skills
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies