Duplicates vs Nodes in MDM Hierarchies

Identification of duplicate records is a core capability in both Data Quality Management (DQM) and in Master Data Management (MDM).

When you inspect records identified as duplicate candidates, you will often have to decide if they describe the same real-world entity or if they describe two real-world entities belonging to the same hierarchy.

Instead of throwing away the latter result, this link can be stored in the MDM hub as well as a relation in a hierarchy (or graph) and thus support a broader range of operational and analytic purposes.

Individual Persons and Households

In business-to-consumer (B2C) scenarios a key challenge is to have 360 degree view of private customers either as individual persons or a household with a shared economy.

Here you must be able to distinguish between the individual person, the household and people who just happen to live at the same postal address. The location hierarchy plays a role in solving this case. This quest includes having precise addresses when identifying units in large buildings and knowing the kind of building. The probability of two John Smith records being the same person differs if it is a single-family house address or the address of a nursing home.

Companies / Organizations in Company Family Trees

In business-to-business (B2B) scenarios a key challenge is to have 360 degree view of these customers. Similar 360 scenarios exist with suppliers and other business partners.

Organizations can belong to a company family tree. A basic representation for example used in the Dun & Bradstreet Worldbase is having branches at a postal address. These branches belong a legal entity with a headquarter at a given postal address, where there may be other individual branches too. Each legal entity in an enterprise may have a national ultimate mother. In multinational enterprises, there is a global ultimate mother. Public organizations have similar often very complex trees.

Products by Variant and Sourcing

Products are also formed in hierarchies. The challenge is to identify if a given product record points to a certain level in the bottom part of a given product hierarchy. Products can have variants in size, colour and more. A product can be packed in different ways. The most prominent product identifier is the Global Trade Identification Number (GTIN) which occur in various representations as for example the Universal Product Code (UPC) popular in North America and European (now International) Article Number (EAN) popular in Europe. These identifiers are applied by each producer (and in some cases distributor) at the product packing variant level.

Another uniqueness issue for products is around what is called multi-sourcing, being that the same product from the same original manufacturer can be sourced through more than one supplier each with their pricing, discount model, terms of delivery and terms of payment.

Solutions Available

When looking for a solution to support you in this conundrum the best fit for you may be a best-of-breed Data Quality Management (DQM) tool and/or a capable Master Data Management (MDM) platform.

This Disruptive MDM / PIM /DQM List has the most innovative candidates here.

How to Detect and Deal with Duplicates

In the data management world, duplicates are two (or more) records with different values but describing the same real-world entity. The most common occurrence is probably having two records describing the same person as for example:

  • Bob Smith at 1 Main Str in Anytown
  • Robert Smith at One Main Street in Any Town

Having duplicates is a well-known pain-point in business scenarios and efforts to remove that pain-point are going on around the clock in organizations across the planet.

Build or Buy?

Some efforts are done by building duplicate detection procedures in-house and some are done by buying a tool for that.

Home-grown solutions often rely on publicly available algorithms like edit distance and soundex. However, they are small efforts against a huge problem.

When trying to detect duplicates you will run into false positives that are results that indicate that two (or more) records describe the same real-world entity – but they do not as exemplified in the post Famous False Positives. Moreover, there could be heaps of false negatives, which are records that do describe the same real-world entity, but that are not detected by the algorithm.

Enhanced Approaches

Using Machine Learning (ML) and Artificial Intelligence (AI) to avoid false positives and false negatives has been in use for years in deduplication and the underlying data matching. With the recent rise of ML/AI this approach has been more common. Today we see data matching tools relying heavily on ML/AI approaches.

Another enhanced approach you can find in tools on the market is utilizing external data in the quest for detecting duplicates. This way of overcoming the obstacles is described in the post Using External Data in Data Matching.

A huge number of false negatives is besides limitations in comparing and detecting the similarity between records with possible duplicates also based on the ability to having the right records up for comparison. If you have more than a few thousand records in play you need an initial candidate selection procedure in place as pondered in the post Candidate Selection in Deduplication.

Going All the Way

When you have found your matching records the next question is what to do with them? This encompasses to possibly form a golden record and/or to place results in master data hierarchies. How you can do that was elaborated in the post Deduplication as Part of MDM.

All in all it is not advisable to do all this at home (in-house in an organization) as there are tools on the market build upon years of experience with solving the issues and going all the way.

Find the most innovative candidates here.

The Rise of Interenterprise MDM

The recent Gartner Magic Quadrant for Master Data Management Solutions has this strategic planning assumption:

By 2023, organizations with shared ontology, semantics, governance and stewardship processes to enable interenterprise data sharing will outperform those that don’t.

Interenterprise data sharing must be leveraged through interenterprise MDM, where master data are shared between many companies as for example in supply chains. The evolution of interenterprise MDM and the current state of the discipline was touched in the post MDM Terms In and Out of The Gartner 2020 Hype Cycle.

In the 00’s the evolution of Master Data Management (MDM) started with single domain / departmental solutions dominated by Customer Data Integration (CDI) and Product Information Management (PIM) implementations. These solutions were in best cases underpinned by third party data sources as business directories as for example the Dun & Bradstreet (D&B) world base and second party product information sources as for example the GS1 Global Data Syndication Network (GDSN).

In the previous decade multidomain MDM with enterprise-wide coverage became the norm. Here the solution typically encompasses customer-, vendor/supplier-, product- and asset master data. Increasingly GDSN is supplemented by other forms of Product Data Syndication (PDS). Third party and second party sources are delivered in the form of Data as a Service that comes with each MDM solution.

In this decade we will see the rise of interenterprise MDM where the solutions to some extend become business ecosystem wide, meaning that you will increasingly share master data and possibly the MDM solutions with your business partners – or else you will fade in the wake of the overwhelming data load you will have to handle yourself.

Contextual MDM vs Enterprise-Wide, Global, Multidomain MDM

The term “contextual Master Data Management” has been floating around in a couple of years. We can see contextual MDM as smaller pieces of MDM with a given flavour as for example focussing on sub/overlapping disciplines as:

The focus can also be at:

  • A given locality
  • A given master data domain as customer, supplier, employee, other/all party, product (beyond PIM), location or asset
  • A given business unit

You must eat an elephant one bite at a time. Therefore, contextual MDM makes a good concept for getting achievable wins.   

However, in an organization with high level of data management maturity the range of contextual MDM use cases, and the solutions for them, will be encompassed by a common enterprise-wide, global, multidomain MDM framework – either as one solution or a well-orchestrated set of solutions.

One example with dependencies is when working with personalization as part of Product Experience Management (PXM). Here you need customer personas. The elephant in the room, so to speak, is that you have to get the actual personas from Customer MDM and/or the Customer Data Platform (CDP).

The list of solutions on this site covers both one-stop-shopping options for all contextual MDM use cases and specialised solutions for a given contextual MDM use case. Check the growing list here.

Five Product Information Management (PIM) Essential Aspects

A Product Information Management (PIM) solution must encompass some core aspects of handling product data in a digitalized world where products are exchanged online in self-service scenarios. Here are five essential aspects:

Product Identification

PIM id

Usually a product is identified uniquely within each organization using a number or a code that follows the product in all the applications that handles product data. But as products are exchanged between trading partners, an external product identifier is essential in many business processes.

The most common external identifier of a product is a GTIN (Global Trade Identification Number) which has those three most common formats:

  • 12-digit UPC – Universal Product Code, which is popular in North America
  • 13- digit EAN – European/International Article Number, which is popular in Europe
  • 14-digit GTIN, which is meant to replace among others the two above

We know these numbers from the barcodes on goods in physical shops.

It is worth noticing that a GTIN is applied to each packing level for a product model. So, if we for example have a given model of a magic wand, there could be three GTINs applied:

  • One for a single magic wand
  • One for a box of 25 magic wands
  • One for a pallet of 50 boxes of magic wands

Also, the GTIN is applied to a specific variant of a product model. So, if we have a given model of a pair of trousers, there will be a GTIN for each size and colour variant.

This level of product is also referred to as a SKU – Stock Keeping Unit.

Besides the GTIN (UPC/EAN) system there are plenty of industry and national number and code systems in play.

Product Classification

PIM class

There are many reasons for why you need to classify your range of products. Therefore, there are also many ways of doing so. You can either use an external classification system or your homegrown classification tailored to your organizations view of the world.

Here are five examples of an external standard you can use in order to meet the classification standards required by your trading partners:

  • UNSPSC (United Nations Standard Products and Services Code) is managed by GS1 US™ for the UN Development Programme (UNDP). This is an open, global, multi-sector standard for classification of products and services. This standard is often used in public tenders and at some marketplaces.
  • GPC (Global Product Classification) is created by GS1 as a separate standard classification within its network synchronization called the Global Data Synchronization Network (GDSN).
  • Harmonized System (HS) codes are commodity codes lately being worldwide harmonized to represent the key classifier in international trade. They determine customs duties, import and export rules and restrictions as well as documentation requirements. National statistical bureaus may require these codes from businesses doing foreign trade.
  • eCl@ss is a cross-industry product data standard for classification and description of products and services emphasizing on being a ISO/IEC compliant industry standard nationally and internationally. The classification guides the eCl@ss standard for product attributes (in eClass called properties) that are needed for a product with a given classification.
  • ETIM develops and manages a worldwide uniform classification for technical products. This classification guides the ETIM standard for product attributes (in ETIM called features) that are needed for a product with a given classification.

Within each organization you can have one – and often several – homegrown classification schemes that exist besides the external ones relevant in each organization. One example is how you arrange your range of products on a webshop similar to how you would arrange the goods in aisles in a physical shop.

Specific product attributes

PIM attributes

When selling products in self-service scenarios a main challenge is that each classification of products needs a specific set of attributes (sometimes called properties and features) in order to provide the set of information needed to support a buying decision.

So, while some attributes are common for all products there will be a set of attributes needed to be populated to have data completeness for this product while these attributes are irrelevant for another product belonging to another classification.

External standards as eClass and ETIM includes a scheme that names and states the attributes needed for a product belonging to a certain classification and also the value lists that goes with it as for example which terms for colours that are valid and can be exchanged consistently between trading partners.

Related products

PIM relation

A core challenge in self-service selling is that you have to mimic what a salesman does: If you enter a shop to buy an intended product, the salesman will like you to walk away with a better (and more expensive) choice along with some other products you would need to fulfil the intended purpose of use.

A common trick in a webshop is to present what other users also bought or looked at. That is the crowdsourcing approach. But it does not stop there. You must also present precisely what accessories that goes with a given product model. You must be able to present a replacement if the intended product is not available anymore (or temporarily out of stock). You can present up-sell options based on the features in question. You can present x-sell options based on the intended purpose of use.

Digital Assets

PIM asset

When your prospective customer can’t see and feel a product online you must present product images of high quality that shows the product (and not a lot of other things too). It can be product images taken from a range of different angles. You can also provide video clips with the given product.

Besides that, there may be many other types of digital assets related to each product model. This can be installation guides, line drawings, certificates and more.

Tools That Can Help

This site has a list of innovative Master Data Management (MDM) and PIM solutions that can help you mastering the core aspects of products information management. Check out the list here.