September 22, 2009

Data mining service crawls billions of Web pages

Startup 80legs launches data mining service that leverages a 50,000-computer grid to search, crunch millions of Web pages in minutes

80legs has officially launched its service, which brings supercomputer-scale data mining of the Web to companies, and even individuals.

The Houston, Texas-based startup leverages a grid of 50,000 servers to search and crunch millions of Web pages within minutes, CEO Shion Deysarkar told Computerworld on Monday ahead of the Demo Fall 09 conference in San Diego.

[ Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]

Target customers include market researchers looking to mine public opinion on a particular product or service, lawyers searching for copyright infringement and piracy, or online ad agencies looking to do competitive analysis of where rival firms are placing their ads, Deysarkar said.

But some individuals are even using the 80legs beta to research reviews and opinions on various wines. This involves 80legs' app, which uses some natural language processing technology and is more sophisticated than simple Google keyword searches, Deysarkar said.

Each search will cost $2 per million pages crawled, plus 3 cents per CPU-hour used. A search involving 1 million pages would be returned within 10-20 minutes, he said, but 80legs can search the entire Web if so desired.

Customers must fill out a job form and either select one of the semantic analysis, or text extraction apps, written by 80legs. Or they can upload their own app, which must plug into either a Java or .Net application program interface, or API.

80legs doesn't own its own grid, but instead rents it from a fellow startup, Plura Processing , which shares the same venture capital firm, Creeris Ventures.

80legs originally planned to leverage Plura's grid to develop its own Webcrawling-based service, but later decided to "let other people develop their own services and ideas while we provide the crawling," Deysarkar said.

Deysarkar said Amazon Web Services (AWS) is 80legs' main competitor, though he claims companies who use AWS will face three disadvantages: 1) they will only be able to leverage a fraction of 80legs' 50,000 node-grid; 2) they will have to go to the expense and trouble of writing their own webcrawling app; 3) they will pay more than twice as much in crawling and usage charges.

80legs plans to offer Perl and Python APIs in the future. And in two months, the company aims to release its own iPhone-like App Store for independent developers to sell apps to end users.

In contrast to Apple's App Store, developers will be able to set their own price and keep 100 percent of the revenue, Deysarkar said.

Computerworld is an InfoWorld affiliate.

Close

On Twitter now

Data mining

Powered by Twitter

On Twitter now

White Paper

D2D Virtual Tape Library Replication Primer

This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.

Download now »

White Paper

An Alternative to Virtualization for Datacenter Cost Savings

Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.

Download now »

White Paper

Why Your Firewall, VPN, and IEEE 802.11i Aren't Enough to Protect Your Network

The emergence of WLANs has created a new breed of security threats to enterprise networks.

Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation

Download now »

White Paper

Bringing the Edge to the Data Center

Effectively address data protection challenges, implementing solutions that help store and protect business–critical data while cutting costs and improving efficiency and reliability.

Download now »

Sign up to receive InfoWorld Resource Alerts

Subscribe to the Technology: Data Management Newsletter

The one-stop resource center for IT professionals.

©1994-2009 Infoworld, Inc.