When I was a young programmer at an investment bank, my desk was next to the department of “data integrity,” a small group with the thankless job of making sure that the databases held accurate records of stock transactions. The bank’s computers could process millions of transactions in seconds, but a mistyped key or a missing value could jam the entire assembly line for data.
When things were running smoothly, we would amuse ourselves with philosophical discussions about just what it meant for data to have integrity. At the time, the bank didn’t want insight or truth in their databases — they just wanted the books to balance and the system to hum along. It was almost as if data integrity were an afterthought.
That view has changed. Data integrity — or data quality, as the current parlance goes — has become a hot topic in many IT departments. The CEO who used to be impressed by the Web site with forms for customers to fill out is now wondering why the data is such a mess. The marketing group wants real leads backed by real data, not a bit dump filled with inconsistency and inaccuracies.
A number of software vendors is tackling the problem by offering tools and packages that treat data as more than a pile of bits: They are building sophisticated, logical frameworks for information and tossing around philosophical words such as "ontology" to describe their models for numbers and strings in the database fields. After all, the problems of data quality exist because bits can never be perfect reflections of the underlying information.
Scrubbing data clean
These systems often have a sophisticated gloss but are typically practical tools designed to help an IT shop remove the most glaring and expensive problems. So while the problems may be framed in elevated terms, the solutions generally take the form of plain old if-then-else statements. The systems scrub, or cleanse, the data by applying rules that remove all possibilities for false duplication. They might replace all instances of “Bob” with “Robert,” for example, or recognize that all old telephone numbers from Palo Alto, Calif., must now come with a 650 area code.
One of the oldest and most common applications for data quality software is address "cleansing," the process whereby a company takes a mailing list and ensures that all of the addresses are current, valid, and as complete as possible. Pitney Bowes Group 1 Software helped the U.S. Postal Service develop the technology for parsing and correcting — and now Pitney Bowes is selling it for more general applications. The technology aggregates rules for understanding addresses into a modular application that can recognize errors, correct them, and add the most complete ZIP code. It can distinguish between the two identical abbreviations in "St. Paul’s St." and understand that "Saint Pauls Street" is the same road.
After early success with cleaning up addresses, Group 1 is now working to open up its tools so that they can help other parts of the enterprise. Navin Sharma, director of product management, explains that one big opportunity is in straightening out customer records, consolidating them when necessary.
Group 1’s latest offering helps the sales force straighten out mistakes: When a new customer record arrives, Sharma explains, “We standardize it, we validate it and complete it. Is this customer already in the master data hub? Do I already have information? If so, I want to synchronize all of my systems with the latest information; otherwise, I want to add him as a new customer.”
Such cleansing processes can be complicated. Jeff Jonas, chief scientist at IBM’s Entity Analytics Solutions, says, “There are some risks if one overcleans the data — especially if trying to decide which incorrect values can be discarded — because you may end up dropping useful data.”
At IBM, they avoid throwing out any data by venturing a best guess, not a permanent decision, about which values are “clean.” Jonas explains: “Sometimes one learns something later that requires one to rethink an earlier decision, e.g., maybe the bad data turns out to be an essential data point like a person’s new nickname.”
Business makes the call
Getting the input to make decisions about what is correct, or clean, is getting easier, because many of the new products have simple user interfaces that enable everyone in the enterprise to pitch in, a process that takes the weight off the shoulders of the IT department. Karen Hsu, principal product manager for data quality at Informatica, says her company is working to open up its tools to the people at all levels of the corporation.
“What we’ve heard from the customers is, ‘I’m constantly asked to look into why a customer name isn’t correct and that isn’t my expertise',” Hsu says. “So we’ve let the business take on the responsibility. Those types of rules are things that the business can create and monitor on an ongoing basis. If there was a missing part, they would be notified by a dashboard rather than waiting for IT to do it.”
Informatica’s latest offering, like many in the space, offers a visual programming language that can create rules and workflows for cleansing data. They make it easier for nonprogrammers to add rules and tweak the existing ones to cope with changing business conditions.
IBM has its own data quality solutions, WebSphere Product Center and Customer Center, which are designed to help customers create a single, correct version of the truth so that data can be used in a variety of applications without inconsistencies.
The structure and role for such tools is changing rapidly. The original tools were designed to work in the background to remove inaccuracies by parsing information, applying rules, and matching disparate sources. New versions from many vendors work within a service-oriented architecture providing answers immediately, a process that allows developers to eliminate ambiguities or inaccuracies before they occur.
The vendors are also building dashboards that flag problems and let managers drill down into the data set to examine them. One of the biggest new applications for such tools is regulatory compliance. Software to ensure data quality can reduce workloads and prevent companies from inadvertently ignoring the law.
Kathleen Hondru, vice president of marketing at Innovative Systems, says her company is helping clients in banks and insurance companies scrutinize their client lists and look for matches against government watch lists. The company’s matching engine can screen against all of the possible variations on a name and associate all of the potential “aliases” with the original record.
This application is a good example of how a number of tool vendors offers systems that do more sophisticated matching operations than can be easily accomplished with traditional relational databases. The tools preprocess the information and ensure that the matching is faster, simpler, and more consistent.
These applications of different kinds of computer science research show that the domain is just beginning to enter the mainstream of the IT world. In the past, IT managers talked about generating reports, but now they ask whether data cleansing can help them produce more accurate ones. The compliance officers who once asked for simple tracking and alarm bells are now wondering whether better tools can provide more comprehensive oversight.
The future of quality
Better tools for a variety of data quality applications are in the works. Theresa DeRycke is a so-called data therapist for CRMfusion, a company that specializes in data quality solutions for on-demand CRM, including its DemandTools offering for Salesforce. “Once the data is cleaned up, then you have to think about maintaining it,” she says. “I think the next hot topic is execution of the data — territory management. Now that we have all the data in, cleaned, and a way to keep it clean, how do we divvy it up?”
One company, Silver Creek Systems, is taking automation of data matching to the next level with semantic technology. Its DataLens solution separates such complex data as product information into content groups, standardizes it, and creates taxonomies in a manner that minimizes human intervention.
It’s important to note, however, that humans can never be taken out of the equation. Contradictory or incomplete data strewn around the enterprise in various databases and formats is the ugliest problem in IT. Reconciling and normalizing all that data is hard, tedious work. There’s no silver bullet, but new solutions are going a long way toward enabling enterprises to create a single version of the truth without driving IT insane.