Back

DATA PARSING ALGORITHM

WHITE PAPER

 

The data migration process often involves transforming data from the format and rules of the source system to a different set of rules and formats in the target system.  The process of evaluating, standardizing and interpreting the source data so that it can be properly reformatted and stored in the target system is referred to as "parsing".  Parsing is required for many data elements.

 

Perhaps the most global data requiring the use of a parsing algorithm is name and address information.  The name and address parsing system developed by Technology Consultants is built on the concept that name and address information is comprised of numerous components that have common, identifiable characteristics.  Although the process is not infallible, a high degree of success has been achieved in parsing out names and addresses so they can be successfully reformatted for use in a system with different formatting requirements than the system which originally captured the data.  The following description is somewhat simplified but provides an overview of the concept.

 

A name and address block is made up of 3 main components-name lines, address lines and city lines.  Any of these may occur multiple times or may be absent.  Each has particular characteristics that can be identified and is made up of its own set of components.

 

The name lines will be first and are made up of a name prefix (i.e. Mr., Mrs., Ms, etc.) a first name, middle name, and last name and a name suffix, (i.e. Dr., DDS, etc.).  Compound names can be recognized by key words or characters such as "and" or "&".  Business names can be recognized by keywords such as "Company", "Inc." etc.  To be successful the name parser must take into account such things as misspellings, plural forms and abbreviations.  To allow flexibility for the unique characteristics of a particular region or business, the identification of these components is built into tables.  The parser must also have options to deal with names stored in reverse order, i.e. "Smith, John" instead of "John Smith" and allow for various methods of denoting the last name termination, i.e. "," or "#" or ";".

 

The components of the city lines are city, state, zip and country.  City lines are generally recognizable by their position (last), the presence of a recognizable state name or abbreviation and the presence of a 5 or 9 digit number (zip code).  Allowances have to be made to deal with foreign countries, absence of zip codes, state name misspellings, etc.

 

Address lines can generally be recognized by their position (between the name and city lines), and the presence of numeric values, key words and abbreviations (street, avenue, box, etc.).  The components of the address line are the most complicated and include items such as street number, street name, street directional, street type, etc.

 

Once the address parsing algorithm has properly identified the address components, the individual parts can then be reassembled in the form required by the target system.  Specific components can also be standardized, if desired, by using standard abbreviations and by correcting misspellings.  These options often accomplish a significant portion of "scrubbing" of data that is would otherwise have to be done manually.

 

This explanation has focused on the name and address component of data, however, the same general concepts apply to any other data element that requires transformation between systems.

 

Extensive development of the Parsing System used by Technology Consultants has produced a tool with many and varied uses, not all of them related to the parsing of data.  For example, we have used our Parsing System to locate "part numbers" buried in a transaction history file.  Once located and linked to a "part name" also in the transaction history file, we built an extensive "part number/name" table that was essential in the conversion but did not exist in the source system.

 

Ongoing development continues to enhance and refine our Parsing System.  Each conversion has available to it, all of the collective best parsing processes that came from conversions preceding it.  The value of this tool incalculable.

 

by K.W. Norris

Technology Consultants, Inc.

2006