Many Pies

Many Pies

Wednesday, July 30, 2008

Messy data

I'm in the middle of a project to take data from excel spreadsheets and word documents and get them into a database. As you might expect the data is in a bit of a mess. By mess, I mean that it doesn't obey database rules.

(As an aside I have heard Lotus Notes rejected because it's underlying database wasn't relational. The Lotus Notes format was nearer the way people create data than relational databases - missing values, repeated values.)

For example, one rule of databases is that there is only one copy of a thing. Unless you create a second copy (so that it's handy, but you know you're doing that, and if you need to change it you change it in both places).

In these spreadsheets we have people's names. Sometimes we have their name more than once. But are they the same person? If they are we should only have them in the database once, that's the rule. However we just can't tell.

One thing that makes this harder is that people's names are so slippery. They get abbreviated, they get misspelled. I tweeted today
paulmorriss: is wishing everyone had numbers rather than names. Computers could cope a lot better then.
Two people replied
I am not a number ... I am a free man!
Here's a new quote:
The price of freedom is eternal problems with names


In this particular case I'm trying to match a name in a spreadsheet column with a name included in the filename of a word document. E.g. database has
Joe Bloggs
filename has
Joe Bloggs report.doc

I get a match 70% of the time, which is quite good considering how many ways it could go wrong, e.g. database can have
Josephine (Jo) Bloggs
Joe Bloggs (useful comment)

Filenames can have
Jo report.doc
J. Bloggs report
Joe Blogs report

I could have done all the matching by hand, rather than trying to program it, but given the volume of data that would have been extremely tedious.

No comments: