One of the problems we frequently encounter in business is that our data is not as clean as we might want it to be. We are plagued by duplicates, fields that are formatted incorrectly or in the wrong place, needed values that are missing, outdated information, or even data that is flat-out wrong.
This becomes an especially prevalent problem for companies that have to deal with other companies data sets either through consulting or through mergers, but it can happen anywhere due to user error, software glitches, imported lists for marketing, or miscommunication.
In one case, a coworker ended up with his weight in the database coming out as “blue”–his eye color. In another case working with the D&B (Dunn and Bradstreet) business intelligence services, information would be drawn for a company at one point in time and then fall out of date as D&B updated their records.
In another case, the company was using region codes from several different sources and didn’t have a master key to sort them. To this day I still don’t know where “OH” is in Japan.
There are also fake records that come in from various sources. At one company I saw a record for “Doc Hudson from Radiator Springs</a>” and another for “Utopia Planitia, Mars.” These kinds of problems are particularly prevalent for sites that require registration before use.
Cleaning Data is going to be an ongoing series about taking care of these and other related problems.