When I was a young programmer at an investment bank, my desk was next to the department of “data integrity,” a small group with the thankless job of making sure that the databases held accurate records of stock transactions. The bank’s computers could process millions of transactions in seconds, but a mistyped key or a missing value could jam the entire assembly line for data.
When things were running smoothly, we would amuse ourselves with philosophical discussions about just what it meant for data to have integrity. At the time, the bank didn’t want insight or truth in their databases — they just wanted the books to balance and the system to hum along. It was almost as if data integrity were an afterthought.
That view has changed. Data integrity — or data quality, as the current parlance goes — has become a hot topic in many IT departments. The CEO who used to be impressed by the Web site with forms for customers to fill out is now wondering why the data is such a mess. The marketing group wants real leads backed by real data, not a bit dump filled with inconsistency and inaccuracies.
A number of software vendors is tackling the problem by offering tools and packages that treat data as more than a pile of bits: They are building sophisticated, logical frameworks for information and tossing around philosophical words such as "ontology" to describe their models for numbers and strings in the database fields. After all, the problems of data quality exist because bits can never be perfect reflections of the underlying information.
Scrubbing data clean
These systems often have a sophisticated gloss but are typically practical tools designed to help an IT shop remove the most
glaring and expensive problems. So while the problems may be framed in elevated terms, the solutions generally take the form
of plain old if-then-else statements. The systems scrub, or cleanse, the data by applying rules that remove all possibilities
for false duplication. They might replace all instances of “Bob” with “Robert,” for example, or recognize that all old telephone
numbers from Palo Alto, Calif., must now come with a 650 area code.
One of the oldest and most common applications for data quality software is address "cleansing," the process whereby a company takes a mailing list and ensures that all of the addresses are current, valid, and as complete as possible. Pitney Bowes Group 1 Software helped the U.S. Postal Service develop the technology for parsing and correcting — and now Pitney Bowes is selling it for more general applications. The technology aggregates rules for understanding addresses into a modular application that can recognize errors, correct them, and add the most complete ZIP code. It can distinguish between the two identical abbreviations in "St. Paul’s St." and understand that "Saint Pauls Street" is the same road.
Peter Wayner is contributing editor of the InfoWorld Test Center.
Talkback
E-mail
Printer Friendly
Reprints





