Data migration is hard to do

Data munging and data migration come with big challenges -- and big rewards

Page 2 of 2

We might strip these with a regex such as /[^0-9]+/ and reassemble them after splitting them up with a regex like /([0-9]{3})([0-9]{3})([0-9]{4})/, with the resulting matches holding 212, 555, and 1212. We can now reformat the phone number however we like. We should also bail out if we encounter a number that cannot be a phone number because it has too many or too few digits.

The free-form free-for-all

It gets dicier as we move into more free-form fields. Addresses are particularly picky since they can be formatted in a wide variety of ways. We also need to deal with the vagaries of street and city names. We need to make sure that we correctly handle "Washington,DC," "Washington, DC," and "Washington DC" properly, along with oddities such as "Winston-Salem, NC," "King of Prussia, PA," "Scranton, Penn.," "N. Providence RI," "Houston, tx," and "O'Fallon, IL."

Those sorts of variations can trip up parsers if they're not accounted for because we can't strip special characters. Additionally, we can't count on a set number of spaces defining the city versus the state or the state abbreviation being present or capitalized. Thus, we need to construct a conditional expression to push that string through to make a best-effort determination of the actual city and state, perhaps even checked against a database consisting of every city and state in the United States. Depending on the result, we may still need to bail out on that record if no definitive determination can be made or at least throw a flag that the record in question needs to be manually checked.

We've only begun to scratch the surface. We have a significant amount of work invested in figuring out just the city, state, and phone number of each record. We need to rinse and repeat for every other field in the spreadsheet, depending on the content.

This mess is a direct result of unconstrained free-form data entry, and it plagues every company everywhere. It doesn't have to be Excel, either. It could be Access, a homegrown database, or any other application. Unless there are checks against the validity and formatting of data upon input, the data will likely be a mess. Of course, that's the point of building a proper database front end to handle data input: We can clean and groom the data on the way in, which greatly enhances the accuracy and usability of that data on an ongoing basis. That's one of the major benefits of using databases in the first place.

However, we cannot discount the effort involved in postprocessing these types of data sets. All kinds of tools have been developed to ease this process, but they will not apply to every case. While they may work for a portion of the input data, the parts that are missed may make their use more problematic than not.

Work of this nature is tedious and exceptionally detail oriented. It requires lots of manual data inspection, test runs, debugging, and forward thinking on the part of the developer working on the project. When all is said and done, the results are almost guaranteed to be worth the effort.

Working with clean data makes everything easier. Just don't underestimate how challenging the journey to clean data may be.

This story, "Data migration is hard to do," was originally published at InfoWorld.com. Read more of Paul Venezia's The Deep End blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

To comment on this article and other InfoWorld content, visit InfoWorld's LinkedIn page, Facebook page and Twitter stream.
| 1 2 Page 2
From CIO: 8 Free Online Courses to Grow Your Tech Skills
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.