Why we need data we can believe in

As we plunge headlong into the cloud era, we need reliable, API-accessible data sources to power a new generation of applications

You've probably heard the cliché a million times: Data is the most valuable asset of any business. Originally, that was meant to apply to a company's financial, customer, and product data.

But what about data outside the organization? Companies spend selectively, often at premium rates, for data directly relevant to their business -- D&B for financial data, Experian for credit information, and so on. But as with software, the market for data has opened up, with much of it available free of charge.

[ Get the no-nonsense explanations and advice you need to take real advantage of cloud computing in InfoWorld editors' 21-page Cloud Computing Deep Dive PDF special report. | Cut straight to the key news for technology development and IT management with the InfoWorld Daily newsletter. ]

As InfoWorld's Paul Krill noted last month, developers are increasingly turning to API-accessible data sources for their applications. Last week InfoWorld posted "12 APIs every programmer should know about," which includes everything from a feed of real-time flight delays to the definitive repository of the U.S. government's social media accounts.

A number of aggregators pull together a wild mix of data sources. The Windows Azure Data Marketplace was an early mover and today offers 167 data sources, 82 of which are free. The Programmable Web, a decade-old Web directory recently bought by MuleSoft, lists thousands of APIs that return data, though many have fallen into disrepair. Several upstarts, such as the big data venture InfoChimps, aggregates thousands of data sets and APIs -- although, again, many are out of date or no longer available.

The data-as-a-service game is tough. A startup called Factual launched in 2007 with ambitions of becoming a clearinghouse for a huge range of data, but narrowed its sights in 2010 to delivering high-quality, location-based data. I recently interviewed Factual's founder and CEO, Gil Elbaz, who also co-founded Applied Semantics, the developers of AdSense, bought by Google for $102 million in 2003.

When I asked Elbaz about the technology behind Factual, it quickly emerged that most of it was devoted to ensuring data quality. He says, "You need to run cleaning algorithms against the raw data. The paradigm that we believe in is that you should always store and reprocess data from fundamentals. So we store the rawest form of data -- all the data we collect, either from the Web or from our partners. Any time any of our cleaning algorithms improves even slightly, we don't apply it to the database -- we apply it to the underlying raw data sources, which is why we have such large storage requirements."

1 2 Page 1