Deduplication and Master Data Management

A core intersection between Data Quality Management (DQM) and Master Data Management (MDM) is deduplication. The process here will basically involve:

  • Match master data records across the enterprise application landscape, where these records describe the same real-world entity most frequently being a person, organization, product or asset.
  • Link the master data records in the best fit / achievable way, for example as a golden record.
  • Apply the master data records / golden record to a hierarchy.

Data Matching

The classic data matching quest is to identify data records that refer to the same person being an existing customer and/or prospective customer. The first solutions for doing that emerged more than 40 years ago. Since then the more difficult task of identifying the same organization being a customer, prospective customer, vendor/supplier or other business partner has been implemented while also solutions for identifying products as being the same have been deployed.

Besides using data matching to detect internal duplicates within an enterprise, data matching has also been used to match against external registries. Doing this serves as a mean to enrich internal records which also helps in identifying internal duplicates.

Master Data Survivorship

When two or more data records have been confirmed as duplicates there are various ways to deal with the result.

In the registry MDM style, you will only store the IDs between the linked records so the linkage can be used for specific operational and analytic purposes.

Further, there are more advanced ways of using the linkage as described in the post Three Master Data Survivorship Approaches.

Master Data Survivorship Approaches

One relatively simple approach is to choose the best fit record as the survivor in the MDM hub and then keep the IDs of the MDM purged records as a link back to the sourced application records.

The probably most used approach is to form a golden record from the best fit data elements, store this compiled record in the MDM hub and keep the IDs of the linked records from the sourced applications.

A third way is to keep the sourced records in the MDM hub and on the fly compile a golden view for a given purpose.

Hierarchy Management

When you inspect records identified as a duplicate candidate, you will often have to decide if they describe the same real-world entity or if they describe two real-world entities belonging to the same hierarchy.

Instead of throwing away the latter result, this link can be stored in the MDM hub as well as a relation in a hierarchy (or graph) and thus support a broader range of operational and analytic purposes.

The main hierarchies in play here are described in the post Are These Familiar Hierarchies in Your MDM / PIM / DQM Solution?

Family consumer citizenWith persons in private roles a classic challenge is to distinguish between the individual person, a household with a shared economy and people who happen to live at the same postal address. The location hierarchy plays a role in solving this case. This quest includes having precise addresses when identifying units in large buildings and knowing the kind of building. The probability of two John Smith records being the same person differs if it is a single-family house address or the address of a nursing home.

Family companyOrganizations can belong to a company family tree. A basic representation for example used in the Dun & Bradstreet Worldbase is having branches at a postal address. These branches belong a legal entity with a headquarter at a given postal address, where there may be other individual branches too. Each legal entity in an enterprise may have a national ultimate mother. In multinational enterprises, there is a global ultimate mother. Public organizations have similar often very complex trees.

Product hierachyProducts are also formed in hierarchies. The challenge is to identify if a given product record points to a certain level in the bottom part of a given product hierarchy. Products can have variants in size, colour and more. A product can be packed in different ways. The most prominent product identifier is the Global Trade Identification Number (GTIN) which occur in various representations as for example the Universal Product Code (UPC) popular in Orth America and European (now International) Article Number (EAN) popular in Europe. These identifiers are applied by each producer at the product packing variant level.

Solutions Available

When looking for a solution to support you in this conundrum the best fit for you may be a best-of-breed Data Quality Management (DQM) tool and/or a capable Master Data Management (MDM) platform.

This list has the most innovative candidates here.

B2B2C in MDM, PIM and DQM

The Business-to-Business-to-Consumer (B2B2C) scenario is becoming of increasing importance in Master Data Management (MDM), Product Information Management (PIM) and Data Quality Management (DQM).

This scenario is usually seen in manufacturing including pharmaceuticals as examined in the post Six MDMographic Stereotypes.

One challenge here is how to extend the capabilities in MDM / PIM / DQM solutions that are build for Business-to-Business (B2B) and Business-to-Consumer (B2C) use cases. Doing B2B2C requires a Multidomain MDM approach with solid PIM and DQM elements either as one solution, a suite of solutions or as a wisely assembled set of best-of-breed solutions.


In the MDM sphere a key challenge with B2B2C is that you probably must encompass more surrounding applications and ensure a 360-degree view of party, location and product entities as they have varying roles with varying purposes at varying times tracked by these applications. You will also need to cover a broader range of data types that goes beyond what is traditionally seen as master data.

In DQM you need data matching capabilities that can identify and compare both real-world persons, organizations and the grey zone of persons in professional roles. You need DQM of a deep hierarchy of location data and you need to profile product data completeness for both professional use cases and consumer use cases.

In PIM the content must be suitable for both the professional audience and the end consumers. The issues in achieving this stretch over having a flexible in-house PIM solution and a comprehensive outbound Product Data Syndication (PDS) setup.

As the middle B in B2B2C supply chains you must have a strategic partnership with your suppliers/vendors with a comprehensive inbound Product Data Syndication (PDS) setup and increasingly also a framework for sharing customer master data taking into account the privacy and confidentiality aspects of this.

This emerging MDM / PIM / DQM scope is also referred to as Multienterprise MDM.

Three Master Data Survivorship Approaches

One of the core capabilities around data quality in Master Data Management (MDM) solutions is providing data matching functionality with the aim of deduplicating records that describes the same real-world entity and thereby facilitate a 360 degree view of a master data entity.

Identifying the duplicates is one thing that is hard enough. However, how to resolve the result of the deduplication process is another challenge.

There are three main approaches for doing that:

Master Data Survivorship Approaches

Enlarge the image here.

In the above example we have three records: An orange, a green and a blue one. They are considered to be duplicates, meaning they describe the same real-world person. 

1: Survival of the fittest record

Selecting the record that according to a data quality rule is the most fit is the simplest approach. The rule(s) that determines which record that will survive is most often based on either:

  • Lineage, where the source systems are prioritized
  • Completeness, like for example which record has the most fields and characters filled

The downside of this approach is that surviving record only have data quality of that selected record, which might not be optimal, and that valuable information for deselected records might get lost.

Data quality tools that are good at identifying duplicates often has this simple method around survivorship.

In the above example the blue record wins and this record survives in the MDM hub, while the orange and the green record only survives in the source system(s).

2: Forming a golden record

In this approach the information from each data element (field) is selected from the record that, by given rules, is the best fit. These rules may be based on lineage, completeness, validity or other data quality dimensions.

Data elements may also be parsed, meaning that the element is split into discrete parts as for example an address line into house number and street name. The outcome may also be a union of the (parsed) data elements coming from the source systems.

In that way a new golden record is formed.

Additionally, values may also be corrected by using external directories which acts as a kind of source system.

This approach is more complex and while solving some of the data quality pain in the first approach, there will still be situations of mixing wrongly and lost information as well as it is hard to rollback an untrue result.

In the above example the golden record in the MDM hub is formed by data elements from the blue, green and orange record – and the city name is fetched from an external directory.

3: Context aware survivorship 

In this approach the identified duplicates are not physically merged and purged.

Instead, you will by applying lineage, completeness and other data quality dimension based rules be able to make several different golden record views that are fit in a given context. The results may differ both around the surviving data elements and the surviving data records.

This is the most complex approach but also the approach that potentially has the best business fit. The downsides include, besides the complexity, possible performance issues not at least in batch processing.

In the above example the MDM hub includes the orange, green and blue record and presents one surviving golden record for marketing purposes and two surviving golden records for accounting purposes.

Digital Transformation Success Rely on MDM / PIM Success

It is hard to find an organization who do not want to be on the digital transformation wagon today. But how can you ensure that your digital transformation journey will be a success? One of the elements in making sure that this data driven process will be a success will be to have a solid foundation of Master Data Management (MDM) including Product Information Management (PIM).

The core concepts here are:

  • Providing a 360-degree view of master data entities: Engaging with your customers across a range of digital platforms is a core part of any digital transformation. Having a 360-degree view of your customer has never been more important, and that starts with well-organized and maintained customer master data. The same is true for supplier master data and other party master data. 360-degree view of locations is equally important. The same goes for products and assets as pondered in post Golden Records in Multidomain MDM.
  • Enabling happy self-service scenarios: Customer data are gathered from many sources and digital self-registration is becoming the most common used method. The self-service theme has also emerged in handling supplier master data as self-service based supplier portals have become common as the place where supplier/vendor master data is captured and maintained. Interacting with your trading partners on digital platforms and having the most complete product information in front of your customers in self-service online selling scenarios requires a solid foundation for product master data and Product experience Management (PxM).
  • Underpinning the best customer experience: Customer experience (CX) and MDM must go hand in hand. Both themes involve multiple business units and digital environments within your enterprise and in the wider business ecosystem, where your enterprise operates. Master data is the glue that brings the data you hold about your customers together as well as the glue that combines this with the data you share about your product offering.
  • Encompassing Internet of Things (IoT): Smart devices that produces big data can be used to gain much more insight about parties (in customer and other roles), products, locations and the things themselves. You can only do that effectively by relating IoT and MDM.

Digital Transformation Success

3 Reasons MDM No Longer Delivers a Customer 360

Today’s guest blog post is from David Corrigan, CEO at AllSight

When Master Data Management (MDM) and Customer Data Integration (CDI) were designed over 15 years ago, they were touted as the answer to “Customer 360”.  But the art of mastering data and the art of creating a complete view of a customer are two very different things.  MDM is focused on managing a much smaller, core data set and aims to very deeply and truly master it.  Customer 360 solutions focus on “all data about the customer” to get the complete picture.  When it comes to a 360-degree view of the customer, master data is only part of the story.  Additional data has to be part of the 360 in order to have a full understanding of the customer – whether that be an individual or an organization. Additional data sources and data types required for today’s Customer 360 include transactions, interactions, events, unstructured content, analytics and intelligence – all of which are not managed in MDM.

Today, leading organizations are looking beyond MDM to a new era of Customer 360 technology to deliver the elusive complete view of the customer.  Here are 3 reasons why

  1. Customer 360 needs all data; MDM only stores partial data.  MDM focuses on core master data attributes, matching data elements and improving data quality.  Customer 360 has rapidly evolved requiring big data sets such as transactions and interactions, as well as unstructured big data like emails, call center transcriptions, and web chat interactions not to mention social media mentions, images and video.
  2. Customer 360 must serve analytical and operational needs; MDM only supports operational processing.  The original intent of MDM was to provide ‘good’ data to CRM and transactional systems.  While a branch of MDM evolved for ‘analytical MDM’ use cases, it was really a staging area for quality and governance to occur before data was loaded into warehouse for analysis and reporting.  A Customer 360 is meant to be analyzed and used by marketing analysts, data scientists as well as customer care and sales staff – it powers many different personas with different perspectives of the customer.
  3. Customer 360 is about improving the customer experience; Master data (core data) is used during a customer experience. Master Data is required during a customer interaction to understand key facts about a customer including name, contact info, account info, etc.  But the Customer 360 needs to blend all interactions, transactions and events into a comprehensive customer journey in order to analyze and personalize customer experiences.

But why does a Customer 360 now require all this information and capabilities beyond traditional MDM?  It is because the expectations of customers and the demands they make on businesses have changed.  Customers want personalized service and they want it now.  And they want a consistent experience across all channels – online, via phone, and in store.  They don’t want to have to repeat themselves or their preferences every time they interact with a business.  This requires companies to know more about their customers and to anticipate their next move in order to retain their business and loyalty.  Because, not only are customers more demanding than ever, but it is also easier for them to switch brands with little to no cost.

In order to meet these demands, many organizations assume they need to build these capabilities on their own using new technologies such as Apache Hadoop and Graph data stores.  These technologies can join together silos of master data, transactional data, raw data lake data, and experience/journey analytics.  However, a new class of software is emerging that bridges data, analytics and action and is based on these modern technologies. Customer Intelligence Platforms manage all customer information and synthesize it into an intelligent Customer 360.  Synthesizing all of those data sources is no easy task and that is where many organizations stall out.  What’s required is a machine-learning contextual matching engine that automates the process of linking customer data and evaluates data confidence.

Organizations such as Dell are seeing this shift first hand and have recognized that legacy MDM apps alone aren’t cutting it.  Deotis Harris, Senior Director, MDM at Dell EMC said “We saw an opportunity to leverage AllSight’s modern technology (Customer Intelligence), coupled with our legacy systems such as Master Data Management (MDM), to provide the insight required to enable our sellers, marketers and customer service reps to create better experiences for our customers.”

If you are like Dell and so many other organizations, a Customer 360 is high on your priority list.  A Customer Intelligence Platform might just be your next step.

MDM not 360