When President Barack Obama signed the Open Data Executive Order last May, many IT leaders applauded the White House's decision to release treasure troves of public data as part of an important government initiative for greater transparency.
However, what many didn't bargain for was the state in which they'd find these once-buried data sets. "A dog's breakfast," "a train wreck," "a massive hairball" -- those are a few of the terms IT leaders have used to describe the vast volumes of public data now being made available to the general public.
[ Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]
Yet the business opportunities are unprecedented -- open data offers bits and bytes of public information that are freely available for anyone to use to build new businesses, generate revenue, develop new products, conduct research or empower consumers. With the federal government as the single largest source of open data in the U.S., we now have unfettered access to information about everything from bus routes and pollution levels to SEC filings and weather patterns. Savvy businesses are using public data to predict consumer behavior, shape marketing campaigns, develop life-saving medications, evaluate home properties, even rate retirement plans.
In fact, recent research from the McKinsey Global Institute predicts that the use of open data could lead to the creation of $3 trillion or more in annual value across the global economy. "The vast potential of open data is not only to increase transparency and accountability of institutions, but also to help create real economic value both for companies and individuals," says Michael Chui, a partner at the McKinsey Global Institute, the business and economic research arm of McKinsey & Co.
Hoping to capitalize on this open data revolution, IT leaders are taking the lead in discovering the value of converting terabytes of data into new revenue streams. Forget about the open-source movement's clarion call for free software, greater collaboration and anti-establishment bootstrapping. Today's open data trend is driven by a desire for both greater government transparency and a fatter bottom line. And as more and more techies clamor for a seat at the table, they're finding that the era of open data represents a prime opportunity to prove that they're indispensable revenue-generators, not just server-room sages.
"We're at a tipping point," says Joel Gurin, senior adviser at New York University's Governance Lab (GovLab) and author of Open Data Now: The Secret to Hot Startups, Smart Investing, Savvy Marketing, and Fast Innovation. "This is the year open data goes from being a specialized expertise to becoming part of a CIO's tool kit. It's a very exciting time."
But unlocking open data's value remains a challenge. For one thing, much of today's open data flows from a whopping 10,000 federal information systems, many of which are based on outdated technologies. And because open data can be messy and riddled with inaccuracies, IT professionals struggle to achieve the data quality and accuracy levels required for making important business decisions. Then there are the data integration headaches and lack of in-house expertise that can easily hinder the transformation of open data into actionable business intelligence.
Yet for those IT leaders who manage to convert decades-old county records, public housing specs and precipitation patterns into a viable business plan, "the sky's the limit," says Gurin.
Gurin should know. As part of his work at the GovLab, he's compiling a who's who list of U.S. companies that are using government data to generate new business. Known as the Open Data 500, the list features a wide array of businesses -- from scrappy startups like Calcbench, which is turning SEC filings into financial insights for investors, to proven successes like The Climate Corporation. Recently purchased by Monsanto for about $1 billion, The Climate Corporation uses government weather data to revamp agricultural production practices.
Although their business models vary wildly, there's one thing the Open Data 500 companies have in common: IT departments that have figured out how to collect, cleanse, integrate and package reams of messy data for public consumption.
Take Zillow, for example. Zillow is an online real estate database that crunches housing data to provide homeowners and real estate professionals with estimated home values, foreclosure rates and the projected cost of renting vs. buying. Founded in 2005 by two former Microsoft executives, the Seattle-based outfit is now valued at more than $1 billion. But success didn't arrive overnight.
"Anyone who has worked with public record data knows that real estate data is among the noisiest you can get. It's a train wreck," says Stan Humphries, chief economist at Zillow, referring to the industry's lack of standard formatting. "It's our job to take that massive hairball and pull relevant facts out of it."
Today, Zillow has a team of 16 Ph.D.-wielding data analysts and engineers who use proprietary advanced analytics tools to synthesize everything from sales listings to census data into easy-to-digest reports. One of the biggest challenges for Zillow's IT department has been creating a system that integrates government data from more than 3,000 counties.
"There is no standard format, which is very frustrating," says Humphries. "We've tried and tried to push the government to come up with standard formats, but from [each individual] county's perspective, there's no reason to do it. So it's up to us to figure out 3,000 different ways to ingest data and make sense of it."
Complaints about the lack of standard protocols and formats are common among users of open data.
"Data quality has always been and will continue to be a very important consideration," says Chui. "It's clear that one needs to understand the provenance of the data, the accuracy of the data, how often it's updated, its reliability -- these will continue to be important problems to tackle."
So far, Zillow's plan of attack using "sophisticated big data engineering" is working. In fact, a new set of algorithms has helped improve Zillow's median margin of error from 14 percent to 8 percent. And for 60 percent of all sales that occur, Zillow's estimated sales price is within 10 percent of the actual figure.
Tapping the richness of weather data
Although best known for the friendly forecasts on The Weather Channel and Weather.com, The Weather Company has spent two years branching out and fashioning itself into a provider of a powerful big data analytics platform. Today, the Atlanta-based company's WeatherFX data service ingests more than 20TB of data per day, including satellite pictures, radar imagery and more, from more than 800 public and private sources. By crunching terabytes of information into insights that impact the bottom line, WeatherFX is helping insurance companies, media conglomerates and airlines save money, drive revenue and satisfy customers.
For example, by mashing up hail data with policyholder addresses, insurers can alert homeowners to potential damage to their homes and cars. "By warning customers of pending dangers, insurers can encourage customers to protect their personal property, which lessens the impact of claims on insurers caused by bad weather," says Bryson Koehler, CIO at The Weather Company.
Airlines also use weather data. They may, for instance, monitor storm patterns and reposition aircraft to avert scheduling delays. And retailers are discovering that keeping track of the weather can help them anticipate consumer demand and thereby boost sales -- they might, for example, stock their shelves with anti-frizz hair products when a heat wave is expected.
Still, packaging 800 sources of data, much of it open, requires heavy lifting on the part of The Weather Company's IT department. Koehler says the company had to assemble "an incredibly complex environment" to manage "a dog's breakfast" of documents. Nearly two years ago, The Weather Company rebuilt its entire consolidated platform, called SUN (Storage Utility Network), which is deployed on Riak NoSQL databases from Basho Technologies and runs across four availability zones in the Amazon Web Services cloud. Today, the renewable compute platform gathers 2.25 billion weather data points 15 times per hour.
Overseeing this new IT platform is a data science team composed of 220 meteorologists and hundreds of engineers, each with in-depth domain knowledge of atmospheric phenomena. "When you're ingesting data from 800 different sources, you need to have some level of expertise tied to each one," says Koehler. "Most Java developers aren't going to be able to tell you, in intricate detail, the difference between a 72 and a 42 on a dew-point scale and how that may or may not impact a business."
Yet for all the IT leaders spearheading today's open data revolution, many argue that it's time the U.S. government played a greater role in the collection, cleaning and sharing of data. In fact, open data services provider Socrata reports that 67.9 percent of the everyday citizens surveyed for its 2010 Open Government Data Benchmark Study said they believe that government data is the property of taxpayers and should be free to all citizens. Such sentiment has already prompted the U.S. government to launch new services through its Data.gov website, enabling visitors to easily access statistical information. But that hasn't stopped techies from drawing up a laundry list of open data demands for government officials.
"A standard structure, a standard set of identifiers, greater data cleanliness, releasing data in a database-friendly format, making it machine-readable, making sure we can use the data without restrictions -- these are all ways that government can improve the data they're supplying," says Ryan Alfred, president of BrightScope, a provider of financial information and investment research.
Alfred has every right to grumble. He spent "five long years" huddled in public disclosure rooms at the Department of Labor poring over paper-based retirement plans and auditors' reports before launching BrightScope. Today, his San Diego-based company sells Web-based software that provides corporate plan sponsors, asset managers and financial advisers with in-depth retirement plan ratings and investment analytics. Many of BrightScope's ratings are published online for free, while advisers and large enterprises fork over anywhere from $5,000 to $200,000 per year for highly sophisticated and customized prospecting tools.
A team of nearly 15 data analysts processes more than 60,000 retirement plans per year while integrating data captured from SEC filings, financial industry regulatory authorities, census data and public websites. Once collected, the data is then cleansed and linked together using proprietary mapping algorithms.
"There's a big difference between free open data and actionable intelligence," says Alfred. "Companies don't want to purchase data that requires data analysts and infrastructure and integration. So there's a ton of work on both the manual side and the engineering side to get the data into a format that can be reliably used by our clients." That's a complex process that, he says, could be simplified with a bit of help from the government.
The good news is that the government is taking steps to improve data quality. Calcbench has benefited from that effort. The New York-based startup has created a sophisticated engine that turns complex financial data such as earnings statements, cash flow statements and companies' balance sheets into a more readable format. To ensure accuracy, Calcbench uses proprietary artificial intelligence tools that sift through the data to detect errors such as misdated year-end reports. Finance professionals such as investors, auditors and industry researchers use the resulting repackaged and cleansed data to compare financial ratios, examine entire industries and review competitors' disclosure data.
"A lot of what we're charging for isn't the data," says Alex Rapp, Calcbench's CTO and co-founder. "It's how we structure it, how we store it for you and how we solve a tremendous number of business problems by making data comparable across different companies and industries. That's something the data doesn't do itself."
Nor is it a service that would have been possible without the U.S. government's insistence, starting in 2009, that the financial information in corporate SEC filings must be in XBRL (Extensible Business Reporting Language) -- a freely available and global standard format for exchanging business information. Whether you're downloading quarterly filings directly into spreadsheets or analyzing year-end statements using off-the-shelf software, XBRL converts fragmented financial data into a machine-readable format. And that makes life easier for Calcbench's techies.
But as government agencies inch toward cleaner, more structured data sets, a handful of startups are offering powerful new tools for dealing with older data. "These intermediaries are going to make open data more accessible to other types of companies because the reality is, open data is still pretty messy at the local, state and federal level," says Gurin.