How to Detect and Deal with Duplicates

In the data management world, duplicates are two (or more) records with different values but describing the same real-world entity. The most common occurrence is probably having two records describing the same person as for example:

  • Bob Smith at 1 Main Str in Anytown
  • Robert Smith at One Main Street in Any Town

Having duplicates is a well-known pain-point in business scenarios and efforts to remove that pain-point are going on around the clock in organizations across the planet.

Build or Buy?

Some efforts are done by building duplicate detection procedures in-house and some are done by buying a tool for that.

Home-grown solutions often rely on publicly available algorithms like edit distance and soundex. However, they are small efforts against a huge problem.

When trying to detect duplicates you will run into false positives that are results that indicate that two (or more) records describe the same real-world entity – but they do not as exemplified in the post Famous False Positives. Moreover, there could be heaps of false negatives, which are records that do describe the same real-world entity, but that are not detected by the algorithm.

Enhanced Approaches

Using Machine Learning (ML) and Artificial Intelligence (AI) to avoid false positives and false negatives has been in use for years in deduplication and the underlying data matching. With the recent rise of ML/AI this approach has been more common. Today we see data matching tools relying heavily on ML/AI approaches.

Another enhanced approach you can find in tools on the market is utilizing external data in the quest for detecting duplicates. This way of overcoming the obstacles is described in the post Using External Data in Data Matching.

A huge number of false negatives is besides limitations in comparing and detecting the similarity between records with possible duplicates also based on the ability to having the right records up for comparison. If you have more than a few thousand records in play you need an initial candidate selection procedure in place as pondered in the post Candidate Selection in Deduplication.

Going All the Way

When you have found your matching records the next question is what to do with them? This encompasses to possibly form a golden record and/or to place results in master data hierarchies. How you can do that was elaborated in the post Deduplication as Part of MDM.

All in all it is not advisable to do all this at home (in-house in an organization) as there are tools on the market build upon years of experience with solving the issues and going all the way.

Find the most innovative candidates here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s