If you've read anything about the phenomenon of Big Data, then you probably picture armies of servers chugging some form of the Hadoop ecosystem to crunch mountains of information streaming from internal corporate transactions, emails, instant messages, and Web logs or external social media interactions and public information.
Sounds expensive, right? Not necessarily. Thanks to cloud computing, freely available tools, and limitless free data, you can do major league Big Data analysis for a C-note.
At a GigaOm conference on Big Data last week, Pete Warden, who claims he lives on ramen noodles, described how he spent just $100 to scrape 500 million Web pages, including 220 million Facebook public profiles, using his own Web crawler and a 100-machine cluster running on Amazon EC2. He was able to analyze the information to match Twitter, LinkedIn, and Facebook accounts with the email accounts of users of his email tool.
Then, just for fun, he created interactive maps showing how various countries, U.S. states, and cities connect with each other over social media and what types of fan pages they frequent. For example, users in Idaho and Utah tend to connect with people inside their state and nearby states, while East and West Coast cities have lots of connections with each other. Los Angeles users tend to like Michael Jackson, Starbucks, and Megan Fox fan pages, while Eastern Idaho users are more into Glenn Beck and the Church of Jesus.
The result was legal action from Facebook, which cost him about 30 times more than he spent collecting and displaying the data. However, he was able to reach an agreement that didn't run him anything beyond the legal fees.
Warden has just announced the release of his Data Science Toolkit, a set of free tools and interfaces that enable you to analyze massive amounts of unstructured data. It includes OCR capabilities that can convert scanned images and PDFs to text files and tools for filtering geographic locations from news items and blogs. You can run it on a Hadoop cluster in an Amazon EC2 cloud or download the toolkit as a virtual machine.
Additionally, an eight-person company called Tap11 has a tool for tapping and analyzing 140 million tweets daily to help companies understand what users are saying about their brand and products. Client Goldman Sachs, for example, discovered a lot of talk about criminals and bonuses. Another financial planning tool startup called Bundle provides a service that taps Citi customer transaction data to allow users to discover how other people in their geographical area or the area they're moving to typically spend their money.
So it turns out Big Data doesn't necessarily cost big bucks, which means that the potential for hundreds of startups analyzing just about everything is enormous. It's a new frontier. Just watch out for lawyers guarding territory already staked out.
This article, "Big Data runs afoul of big lawyers," was originally published at InfoWorld.com. Get the first word on what the important tech news really means with the InfoWorld Tech Watch blog. For the latest business technology news, follow InfoWorld.com on Twitter.