DATA PARSING Gets
at the Roots of Your Legacy Data
Get at the roots of your data conversion
problems by designing a custom data parsing algorithm. This is an iterative process which requires a
knowledgeable programmer and a user who is well versed in the actual data, what is looks like, its anomalies and how they came
about. The business decisions
that are required depart from the typical standard applied to programming work
because the data to be parsed is
generally passed from one system to another only once. That being the case,
some parsing will be manual. It being the least costly
solution.
A successful data migration project involves
transforming data from the format and data integrity rules (or lack of rules)
of the source system into a different format and a different set of rules,
those that makeup the data integrity of your new system. The process of evaluating, standardizing and
interpreting the source data so that it can be properly reformatted and stored
in the target system is
sometimes referred to as "parsing".
All data in the source system must be looked at before migrating it to
the new system. Nothing should be taken for granted regardless of how well
defined the data entry procedures are. Parsing will be required for many of
those data elements.
Perhaps the most global data items requiring
the use of a parsing algorithm are name and address information. Name and address parsing is based on the concept that name and address
information is comprised of numerous components that have common,
identifiable characteristics. Although the process is not infallible, a high degree of success can
be achieved in parsing out names and addresses so they can be successfully
reformatted for use in a system with different formatting requirements than the
system which originally captured the data. One of the most common problems that
parsing is less than perfect in resolving is inconsistent data entry.
A simplified overview of parsing name and
address information:
A name and address block is made up of three main components-name lines,
address lines and city/state/zip code lines.
Any of these may occur multiple times or may be absent. Each has particular characteristics that can be identified and is made up of its own set of components.
Alternatively, any two or all three may be combined into one data field.
The name lines will be first and are made up
of a name prefix (i.e. Mr., Mrs., Ms, etc.) a first name, middle name, and last
name and a name suffix, (i.e. Dr., DDS, etc.).
Compound names can be recognized by key words or characters such as
"and" or "&".
Business names can be recognized by keywords such as
"Company", "Inc." etc.
To be successful the name parser must take into account such things as misspellings, plural forms, hyphenated names and
abbreviations. To allow flexibility for
the unique characteristics
of a particular region or business, the identification of these components is built into tables. The parser must also have options to deal
with names stored in reverse order, i.e. "Smith, John" instead of
"John Smith" and, in the case of last name first, must allow for
various methods of denoting the last name termination, i.e. "," or
"#" or ";".
The components of the city lines are city,
state, zip and country. City lines are
generally recognizable by their position (last), the presence of a recognizable
state name or abbreviation and the presence of a 5 or 9 digit number (zip code).
Allowances have to be made to deal with foreign countries, absence of
zip codes, state name misspellings,
and other data entry errors.
Address lines can generally be recognized by
their position (between the name and city lines), and the presence of numeric
values, key words and abbreviations (street, avenue, box, etc.). The components of the address line are the
most complicated and include items such as street number, street name, street
directional, street type, etc.
Once the address parsing algorithm has
properly identified the address components, the individual parts can then be
reassembled in the form required by the target system. Specific components can also be standardized,
if desired, by using standard abbreviations and by correcting
misspellings. These options often accomplish
a significant portion of "scrubbing" of data that is would otherwise
have to be done manually.
The use of translation tables also play a big
part in correcting data entry errors or inconsistencies and changing data items that are not
incorrect but do not conform to the data integrity rules of the new system. An
example might be changing abbreviated insurance company names where no specific
rules were followed and the data does not easily conform to a programmable
solution.
Summary:
This explanation has focused on the name and
address component of data, however, the same general concepts apply to any
other data element that requires transformation between systems.
If you outsource the data conversion it is
important to know that extensive development of a good parsing tool only comes
over time and many conversions.
In some cases the parsing tool can be used to locate data
elements that have not been correlated in the data base but the correlation exists
in transaction items or through some other set of conditions and data.
An example:
By extracting and correlating part numbers and their
associated part names, we electronically built a translation table which we then
used in the conversion. This could have been done
manually but the example cited involved over a million transactions and several
hundred part numbers and their associated part names. It may seem unreasonable
that a system would be developed and put into production where something as
simple as tying the part name and part number together in the data base, was
actually missing. But the conversion was done
in our office and we did have to build the translation table as described.
This is
only one of many examples from my experience of nearly 2000 data conversions. I
am sure the reader has at least one they know of and probably more, not
counting the ones the "got away".
A well thought out and well planned data conversion which
incorporates the use of good tools, a talented programmer and an expert user,
can save many hours before go-live and even more hours after the system is
in use.