Make sure your data is housebroken

You'd never come home with a dirty, pest-infested and untrained puppy you randomly pick in the street. So why would you bring into your systems unclean, ungoverned data of dubious origin?

chien errant

Chien errant - CC BY-SA 3.0

Credit: Lucignolobrescia

This data you are trying to bring in looks pretty nice, but where did you pick it up? Are you certain it's safe to import into our systems? Until you tell me about the origin of the data, and can certify to its state of cleanliness, you are leaving it outside the door! I don't want the data we have gone to great lengths to thoroughly assess, cleanse and enrich, to get corrupted by this new data puppy of yours! It certainly looks cute, but has it received the proper immunizations? Is it data warehouse-broken? And when was the last time it saw a data vet?

You really don't want to bring into your systems any data, coming from anywhere. In a not-too-distant past, the equation was fairly straightforward. Most data was produced internally by your transactions, on your systems. Some data would be provided by trading partners, or purchased from data providers, but the process to acquire this data would be properly designed, a contract with service level agreements would be devised, guaranteeing a proper level of quality and holding you harmless from infringement, from privacy violations and other difficulties caused by improper data collection.

We now live in a digitalized world. More and more, all kinds of data is available for anyone to grab. Whether data is collected through calls to public APIs or via screen scraping, it is extremely easy to harvest all kind of data. But you have no control over the origin of this data, over its reliability, over its accuracy.

All this readily available and easily harvestable data creates new challenges linked to governance:

  • Origin: it's not because the origin of data is unknown that the data is unusable. Actually there are certainly cases where it is of better quality than your own data. However you have to assume that it may be bad, until it has been proven otherwise. Just leave that data in the front yard, or in the mudroom, until you have confirmed that it meets your standards.
  • Reliability and accuracy: if you are going to base mission-critical business processes on this data, you need to confirm that it is fit-for-purpose. This can be done by checking samples, or by executing test-runs of these processes and comparing the outcome with other predictions or with actuals.
  • Liability: the press is filled with examples of data theft and privacy violations linked to improper use of data. If you are bringing harvested data into your systems, ensure that this data was collected appropriately, with the proper levels of consent.
  • Control: simply put, if it's not your data, you have no power in controlling and enhancing governance. You have to rely on a third party, or set of third parties, to properly govern the data. Or, you have to assume that the data is not governed, and use it as such.

So before you bring this cool data puppy home and let him become part of your household, make sure it's not going to cause too much trouble!

This article is published as part of the IDG Contributor Network. Want to Join?

To comment on this article and other InfoWorld content, visit InfoWorld's LinkedIn page, Facebook page and Twitter stream.
From CIO: 8 Free Online Courses to Grow Your Tech Skills
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.