DATA PARSING Gets at the Roots of Your Legacy Data

 

Get at the roots of your data conversion problems by designing a custom data parsing algorithm. This is an iterative process which requires a knowledgeable programmer and a user who is well versed in the actual data, what is looks like, its anomalies and how they came about. The business decisions that are required depart from the typical standard applied to programming work because the data to be parsed is generally passed from one system to another only once. That being the case, some parsing will be manual. It being the least costly solution.

 

A successful data migration project involves transforming data from the format and data integrity rules (or lack of rules) of the source system into a different format and a different set of rules, those that makeup the data integrity of your new system.  The process of evaluating, standardizing and interpreting the source data so that it can be properly reformatted and stored in the target system is sometimes referred to as "parsing".  All data in the source system must be looked at before migrating it to the new system. Nothing should be taken for granted regardless of how well defined the data entry procedures are. Parsing will be required for many of those data elements.

 

Perhaps the most global data items requiring the use of a parsing algorithm are name and address information.  Name and address parsing is based on the concept that name and address information is comprised of numerous components that have common, identifiable characteristics.  Although the process is not infallible, a high degree of success can be achieved in parsing out names and addresses so they can be successfully reformatted for use in a system with different formatting requirements than the system which originally captured the data. One of the most common problems that parsing is less than perfect in resolving is inconsistent data entry.

 

A simplified overview of parsing name and address information:

A name and address block is made up of three main components-name lines, address lines and city/state/zip code lines.  Any of these may occur multiple times or may be absent.  Each has particular characteristics that can be identified and is made up of its own set of components. Alternatively, any two or all three may be combined into one data field.

 

The name lines will be first and are made up of a name prefix (i.e. Mr., Mrs., Ms, etc.) a first name, middle name, and last name and a name suffix, (i.e. Dr., DDS, etc.).  Compound names can be recognized by key words or characters such as "and" or "&".  Business names can be recognized by keywords such as "Company", "Inc." etc.  To be successful the name parser must take into account such things as misspellings, plural forms, hyphenated names and abbreviations.  To allow flexibility for the unique characteristics of a particular region or business, the identification of these components is built into tables.  The parser must also have options to deal with names stored in reverse order, i.e. "Smith, John" instead of "John Smith" and, in the case of last name first, must allow for various methods of denoting the last name termination, i.e. "," or "#" or ";".

 

The components of the city lines are city, state, zip and country.  City lines are generally recognizable by their position (last), the presence of a recognizable state name or abbreviation and the presence of a 5 or 9 digit number (zip code).  Allowances have to be made to deal with foreign countries, absence of zip codes, state name misspellings, and other data entry errors.

 

Address lines can generally be recognized by their position (between the name and city lines), and the presence of numeric values, key words and abbreviations (street, avenue, box, etc.).  The components of the address line are the most complicated and include items such as street number, street name, street directional, street type, etc.

 

Once the address parsing algorithm has properly identified the address components, the individual parts can then be reassembled in the form required by the target system.  Specific components can also be standardized, if desired, by using standard abbreviations and by correcting misspellings.  These options often accomplish a significant portion of "scrubbing" of data that is would otherwise have to be done manually.

 

The use of translation tables also play a big part in correcting data entry errors or inconsistencies and changing data items that are not incorrect but do not conform to the data integrity rules of the new system. An example might be changing abbreviated insurance company names where no specific rules were followed and the data does not easily conform to a programmable solution.

 

Summary:

This explanation has focused on the name and address component of data, however, the same general concepts apply to any other data element that requires transformation between systems.

 

If you outsource the data conversion it is important to know that extensive development of a good parsing tool only comes over time and many conversions.

 

In some cases the parsing tool can be used to locate data elements that have not been correlated in the data base but the correlation exists in transaction items or through some other set of conditions and data.

 

An example:

By extracting and correlating part numbers and their associated part names, we electronically built a translation table which we then used in the conversion. This could have been done manually but the example cited involved over a million transactions and several hundred part numbers and their associated part names. It may seem unreasonable that a system would be developed and put into production where something as simple as tying the part name and part number together in the data base, was actually missing. But the conversion was done in our office and we did have to build the translation table as described.

 

This is only one of many examples from my experience of nearly 2000 data conversions. I am sure the reader has at least one they know of and probably more, not counting the ones the "got away".

 

A well thought out and well planned data conversion which incorporates the use of good tools, a talented programmer and an expert user, can save many hours before go-live and even more hours after the system is in use.