Duplicates vs Nodes in MDM Hierarchies

Identification of duplicate records is a core capability in both Data Quality Management (DQM) and in Master Data Management (MDM).

When you inspect records identified as duplicate candidates, you will often have to decide if they describe the same real-world entity or if they describe two real-world entities belonging to the same hierarchy.

Instead of throwing away the latter result, this link can be stored in the MDM hub as well as a relation in a hierarchy (or graph) and thus support a broader range of operational and analytic purposes.

Individual Persons and Households

In business-to-consumer (B2C) scenarios a key challenge is to have 360 degree view of private customers either as individual persons or a household with a shared economy.

Here you must be able to distinguish between the individual person, the household and people who just happen to live at the same postal address. The location hierarchy plays a role in solving this case. This quest includes having precise addresses when identifying units in large buildings and knowing the kind of building. The probability of two John Smith records being the same person differs if it is a single-family house address or the address of a nursing home.

Companies / Organizations in Company Family Trees

In business-to-business (B2B) scenarios a key challenge is to have 360 degree view of these customers. Similar 360 scenarios exist with suppliers and other business partners.

Organizations can belong to a company family tree. A basic representation for example used in the Dun & Bradstreet Worldbase is having branches at a postal address. These branches belong a legal entity with a headquarter at a given postal address, where there may be other individual branches too. Each legal entity in an enterprise may have a national ultimate mother. In multinational enterprises, there is a global ultimate mother. Public organizations have similar often very complex trees.

Products by Variant and Sourcing

Products are also formed in hierarchies. The challenge is to identify if a given product record points to a certain level in the bottom part of a given product hierarchy. Products can have variants in size, colour and more. A product can be packed in different ways. The most prominent product identifier is the Global Trade Identification Number (GTIN) which occur in various representations as for example the Universal Product Code (UPC) popular in North America and European (now International) Article Number (EAN) popular in Europe. These identifiers are applied by each producer (and in some cases distributor) at the product packing variant level.

Another uniqueness issue for products is around what is called multi-sourcing, being that the same product from the same original manufacturer can be sourced through more than one supplier each with their pricing, discount model, terms of delivery and terms of payment.

Solutions Available

When looking for a solution to support you in this conundrum the best fit for you may be a best-of-breed Data Quality Management (DQM) tool and/or a capable Master Data Management (MDM) platform.

This Disruptive MDM / PIM /DQM List has the most innovative candidates here.

How to Detect and Deal with Duplicates

In the data management world, duplicates are two (or more) records with different values but describing the same real-world entity. The most common occurrence is probably having two records describing the same person as for example:

  • Bob Smith at 1 Main Str in Anytown
  • Robert Smith at One Main Street in Any Town

Having duplicates is a well-known pain-point in business scenarios and efforts to remove that pain-point are going on around the clock in organizations across the planet.

Build or Buy?

Some efforts are done by building duplicate detection procedures in-house and some are done by buying a tool for that.

Home-grown solutions often rely on publicly available algorithms like edit distance and soundex. However, they are small efforts against a huge problem.

When trying to detect duplicates you will run into false positives that are results that indicate that two (or more) records describe the same real-world entity – but they do not as exemplified in the post Famous False Positives. Moreover, there could be heaps of false negatives, which are records that do describe the same real-world entity, but that are not detected by the algorithm.

Enhanced Approaches

Using Machine Learning (ML) and Artificial Intelligence (AI) to avoid false positives and false negatives has been in use for years in deduplication and the underlying data matching. With the recent rise of ML/AI this approach has been more common. Today we see data matching tools relying heavily on ML/AI approaches.

Another enhanced approach you can find in tools on the market is utilizing external data in the quest for detecting duplicates. This way of overcoming the obstacles is described in the post Using External Data in Data Matching.

A huge number of false negatives is besides limitations in comparing and detecting the similarity between records with possible duplicates also based on the ability to having the right records up for comparison. If you have more than a few thousand records in play you need an initial candidate selection procedure in place as pondered in the post Candidate Selection in Deduplication.

Going All the Way

When you have found your matching records the next question is what to do with them? This encompasses to possibly form a golden record and/or to place results in master data hierarchies. How you can do that was elaborated in the post Deduplication as Part of MDM.

All in all it is not advisable to do all this at home (in-house in an organization) as there are tools on the market build upon years of experience with solving the issues and going all the way.

Find the most innovative candidates here.

What is Data Matching and Deduplication?

The two terms data matching and deduplication are often used synonymously.

In the data quality world deduplication is used to describe a process where two or more data records, that describes the same real-world entity, are merged into one golden record. This can be executed in different ways as told in the post Three Master Data Survivorship Approaches.

Data matching can be seen as an overarching discipline to deduplication. Data matching is used to identify the duplicate candidates in deduplication. Data matching can also be used to identify matching data records between internal and external data sources as examined in the post Third-Party Data Enrichment in MDM and DQM.

As an end-user organization you can implement data matching / deduplication technology from either pure play Data Quality Management (DQM) solution providers or through data management suites and Master Data Management (MDM) solutions as reported in the post DQM Tools In and Around MDM Tools.

When matching internal data records against external sources one often used approach is utilizing the data matching capabilities at the third-party data provider. Such providers as Dun & Bradstreet (D&B), Experian and others offer this service in addition to offering the third-party data.

To close the circle, end-user organizations can use the external data matching result to improve the internal deduplication and more. One example is to apply a matched duns-numbers from D&B for company records as a strong deduplication candidate selection criterium. In addition, such data matching results may often result not in a deduplication, but in building hierarchies of master data.

Data Matching and Deduplication

This site has a list of the most innovative providers of data matching and deduplication tools stretching from best-of-breed solutions for Articficial Intelligence (AI) underpinned data matching and deduplication specialists to Master Data Management (MDM) solutions that include data matching and deduplication capabilities. Check the list here.

Deduplication and Master Data Management

A core intersection between Data Quality Management (DQM) and Master Data Management (MDM) is deduplication. The process here will basically involve:

  • Match master data records across the enterprise application landscape, where these records describe the same real-world entity most frequently being a person, organization, product or asset.
  • Link the master data records in the best fit / achievable way, for example as a golden record.
  • Apply the master data records / golden record to a hierarchy.

Data Matching

The classic data matching quest is to identify data records that refer to the same person being an existing customer and/or prospective customer. The first solutions for doing that emerged more than 40 years ago. Since then the more difficult task of identifying the same organization being a customer, prospective customer, vendor/supplier or other business partner has been implemented while also solutions for identifying products as being the same have been deployed.

Besides using data matching to detect internal duplicates within an enterprise, data matching has also been used to match against external registries. Doing this serves as a mean to enrich internal records which also helps in identifying internal duplicates.

Master Data Survivorship

When two or more data records have been confirmed as duplicates there are various ways to deal with the result.

In the registry MDM style, you will only store the IDs between the linked records so the linkage can be used for specific operational and analytic purposes.

Further, there are more advanced ways of using the linkage as described in the post Three Master Data Survivorship Approaches.

Master Data Survivorship Approaches

One relatively simple approach is to choose the best fit record as the survivor in the MDM hub and then keep the IDs of the MDM purged records as a link back to the sourced application records.

The probably most used approach is to form a golden record from the best fit data elements, store this compiled record in the MDM hub and keep the IDs of the linked records from the sourced applications.

A third way is to keep the sourced records in the MDM hub and on the fly compile a golden view for a given purpose.

Hierarchy Management

When you inspect records identified as a duplicate candidate, you will often have to decide if they describe the same real-world entity or if they describe two real-world entities belonging to the same hierarchy.

Instead of throwing away the latter result, this link can be stored in the MDM hub as well as a relation in a hierarchy (or graph) and thus support a broader range of operational and analytic purposes.

The main hierarchies in play here are described in the post Are These Familiar Hierarchies in Your MDM / PIM / DQM Solution?

Family consumer citizenWith persons in private roles a classic challenge is to distinguish between the individual person, a household with a shared economy and people who happen to live at the same postal address. The location hierarchy plays a role in solving this case. This quest includes having precise addresses when identifying units in large buildings and knowing the kind of building. The probability of two John Smith records being the same person differs if it is a single-family house address or the address of a nursing home.

Family companyOrganizations can belong to a company family tree. A basic representation for example used in the Dun & Bradstreet Worldbase is having branches at a postal address. These branches belong a legal entity with a headquarter at a given postal address, where there may be other individual branches too. Each legal entity in an enterprise may have a national ultimate mother. In multinational enterprises, there is a global ultimate mother. Public organizations have similar often very complex trees.

Product hierachyProducts are also formed in hierarchies. The challenge is to identify if a given product record points to a certain level in the bottom part of a given product hierarchy. Products can have variants in size, colour and more. A product can be packed in different ways. The most prominent product identifier is the Global Trade Identification Number (GTIN) which occur in various representations as for example the Universal Product Code (UPC) popular in Orth America and European (now International) Article Number (EAN) popular in Europe. These identifiers are applied by each producer at the product packing variant level.

Solutions Available

When looking for a solution to support you in this conundrum the best fit for you may be a best-of-breed Data Quality Management (DQM) tool and/or a capable Master Data Management (MDM) platform.

This list has the most innovative candidates here.