Free Newsletters
InfoWorld Daily

InfoWorld
Log-in | Register

The perils of dirty data

How important is data cleansing and validation? Read these tales of horror, and beware


Jigsaw.com, an online contacts database geared toward sales professionals, takes a Wiki-style approach to data cleansing. Its 335,000 members get points for uploading their own contacts to Jigsaw and correcting others. Every record must be complete, and if Jigsaw users enter information that's incorrect or old, they lose points. Members spend their points by buying information for people they want to reach.

Jigsaw CEO Jim Fowler says an Atlanta-based technology company recently asked his firm to compare its contacts databases to Jigsaw's and weed out the bad data.

“They had 40,000 records,” he says. “Only 65 percent of them were current and 100 percent were incomplete. We're finding that most of our corporate customers have sets of data so cruddy no one can match to them. Corporations spend millions on CRM, and it's amazing how bad that data is.”

The real value is not the data itself, but the ability to keep up with how quickly it changes.

“The power of Jigsaw is complete data and self-cleansing,” says Fowler. “If our self-correcting mechanisms don't work, we're just another crappy data company.”

5. The war on error
The difference between good data and bad can be as small as a single dot. Penny Quirk, principal consulting manager at Robbins-Gioia's Records and Information Optimization Practice Area, says she once consulted on a major data integration project where everything seemed to go fine. Six months later someone opened a data table and found rows of symbols but no data.

“It was a character coding error,” says Quirk. “They used ellipses in some fields, and wherever someone had entered two dots instead of three it triggered the whole line of data to go corrupt.”

The company had to painstakingly re-create the database from a backup, searching for the ellipses, then replacing them with the actual data.

More often than not, the problem is more than mere data entry errors or garbage in/garbage out. Most organizations fail to adequately plan when moving data between different operating systems or upgrading from older versions of SQL, says Quirk. They'll do it too quickly, using whatever resources are available now with the hope of cleaning it up later. (A bad idea, she adds.) Worse, their test environments and production environments may not match, or they may test using a small subset of data, only to have big problems arise later with the data they didn't test.

“Organizations making dramatic changes in technology without putting forth the necessary time and effort to manage the data reconciliation, integration, and conversions can become victims of bad data,” Quirk says. “As data is moved from one source to another, the number of chances for it to become bad is astronomical.”

Quirk's advice? Don't expect IT departments to validate your data set. Get the power users who work with the data to help plan and test the integration. Before you decide to consolidate, look at all your data fields and identify the applications that may be pulling data from them. When possible, test with all your data, not just a subset because even the tiniest errors can send you and your data into a world of pain.

One final horror story illustrates just how big a small error can become.

Peter Teuten, president and CTO of Keane Business Risk Management Solutions, tells of a client that created an application to determine whether corrupt files were circulating in their systems. If the amount of corrupted data  exceeded a certain threshold, the company would know to implement data protection processes.

The problem? They accidentally inverted the rule set for the data protection system; the more corrupt data it found, the better their systems appeared in the reports.

“The network was eventually infiltrated by a worm, which corrupted their files,” says Teuten. “They had to rebuild most of them from scratch, which cost them millions of dollars. All from a very simple configuration and management error -- two numbers were reversed.”

If that doesn't scare you into approaching your next data management project with caution, nothing will.

Dan Tynan is contributing editor at InfoWorld.
« PREVIOUS PAGE | 1 | 2 | 3 | 4 


Talkback:

commentPost a Comment

 

MOST COMMENTS

 
 





MIGRATING TO VISTA
Join Windows Vista Expert, Richard Whitehead as he presents the benefits and challenges of migrating to Windows Vista. Sponsored by Novell

»  Click here to view this Webcast
  The Path to Enterprise Security
This is your comprehensive guide to Enterprise Security. In it you'll find solutions to the most pressing security threats facing you and your company. Learn the latest on insider threats and how to effectively minimize risk within your organization. Sponsored by Nokia

»  Click here to download now

- Special Advertising Partners -
WHITE PAPERS
 

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
INFOWORLD MARKETPLACE
 
» BUY A LINK NOW
 
 

Video

 
 
 

Podcasts

 
IFW Daily 10/10/2008

A look back at the week: AMD splits into two, Panasonic sets world record...

 
 

 

Columnists

 
 
 

Resource Center


Ads by techwords beta  [See your link here]
 




Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist