Do Twitter analysis the easy way with MongoDB

There's gold in all them tweets if you have Hadoop -- or not. For simple Twitter analysis, try MongoDB's aggregation framework

Page 2 of 3

Asking the questions
To start, I put together an application in Node.js from open source libraries that pulls tweets from the streaming Twitter API and loads them into MongoDB. Looking for tweets containing "worldcup" and "fifa" (two of the most used hashtags for the subject), I pulled in about a million tweets over the course of five games. Each tweet looks something like this.

Now that you have a handle on the tool and a data set, it's time to formulate your questions in the query language.

To find out how many tweets were created in each language, you want to group the tweets by language and count how many tweets are in each group. This can be done in a single step with $group. (On each tweet document, there is a 'lang' field that contains an abbreviated language string.)

db.allTweets.aggregate([
  { $group: {
     _id: '$lang',
     count: {$sum: 1}
  }},
]);

The result:

{ "_id" : "am", "count" : 2 }
{ "_id" : "zh", "count" : 370 }
{ "_id" : "vi", "count" : 667 }
{ "_id" : "ur", "count" : 33 }
{ "_id" : "hy", "count" : 4 }
{ "_id" : "bg", "count" : 112 }
{ "_id" : "el", "count" : 210 }
{ "_id" : "gu", "count" : 4 }
{ "_id" : "ne", "count" : 63 }
...

This almost gives us what we want, but it's not in order, so let's add a step in the pipeline to sort on the count field in descending order:

db.allTweets.aggregate([
  { $group: {
     _id: '$lang',
     count: {$sum: 1}
   }},

   {$sort: {
    count: -1
}}
]);

The result:

{ "_id" : "en", "count" : 516745 }
{ "_id" : "es", "count" : 262056 }
{ "_id" : "pt", "count" : 55117 }
{ "_id" : "ar", "count" : 36122 }
{ "_id" : "fr", "count" : 30003 }
{ "_id" : "ja", "count" : 24851 }
{ "_id" : "in", "count" : 17930 }
{ "_id" : "it", "count" : 15876 }
{ "_id" : "ro", "count" : 9878 }
...

What if you want to see the top five languages used to tweet about World Cup games? Add another step to the pipeline:

The query:

db.allTweets.aggregate([
{ $group: {
_id: '$lang',
count: {$sum: 1}
}},

{$sort: {
count: -1
}},

{$limit: 5}
]);

The result:

{ "_id" : "en", "count" : 516745 }
{ "_id" : "es", "count" : 262056 }
{ "_id" : "pt", "count" : 55117 }
{ "_id" : "ar", "count" : 36122 }
{ "_id" : "fr", "count" : 30003 }

Notice that I'm $sum-ing 1, which adds one to count for each item in the group. I could also sum together values on the documents. If there was a timesViewed property and we wanted to see how many times tweets in a certain language were viewed, it would look like:

db.allTweets.aggregate([
{ $group: {
_id: '$lang',
totalViews: {$sum: '$timesViewed'}
}},
]);

What if you want all of the languages that have above a certain threshold of tweets -- 10,000, for example? Remove the $limit and add a $match to help you filter.

db.allTweets.aggregate([
{ $group: {
_id: '$lang',
count: {$sum: 1}
}},

{ $match: {
count: { $gt: 10000 }
}},

{ $sort: {
count: -1
}}
]);

The result:

{ "_id" : "en", "count" : 516745 }
{ "_id" : "es", "count" : 262056 }
{ "_id" : "pt", "count" : 55117 }
{ "_id" : "ar", "count" : 36122 }
{ "_id" : "fr", "count" : 30003 }
{ "_id" : "ja", "count" : 24851 }
{ "_id" : "in", "count" : 17930 }
{ "_id" : "it", "count" : 15876 }

| 1 2 3 Page 2
From CIO: 8 Free Online Courses to Grow Your Tech Skills
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies