MDM Integrated Matching

author
Talend Documentation Team
EnrichVersion
6.3
6.2
6.1
6.0
EnrichProdName
Talend MDM Platform
Talend Data Fabric
task
Data Quality and Preparation > Matching data
Data Quality and Preparation > Enriching data
EnrichPlatform
Talend MDM Server

MDM Integrated Matching

The article firstly describes important concepts and implementation details related to MDM integrated matching and then provides the REST API description of real time match and explain match services.

In Release 5.4, the MDM validation process is enhanced with integrated matching. The integrated matching groups together similar records and creates a "golden" record which is a consolidated version of all records in the group.

For instance, MDM integrated matching can merge the following two records:

Customer ( id = 1, fname = "John", lname = "Smith" )
Customer ( id = 2, fname = "Johnny", lname = "Smith" )

into:

Customer ( id = new_uuid, fname = "Johnny", lname = "Smith" )

For integrated matching, MDM ensures the following:

  • Each group has a unique id. This makes it possible to find all similar records with a query on the group id.
  • Each golden record creation leads to a Data StewardShip Console (DSC for short) task creation. The DSC task shows how the golden record was built. Note that the DSC task will not be created if the golden record was built by a group with only one staging data record.
Environment

Talend MDM v5.4 or above

Integrated matching Introduction

Integrated matching is responsible for the following tasks:

  • Identity similar records and find values to survive. (This is known as the match and survivorship process. For more information, see the following section Match and survivorship ).
  • Build golden records.
  • Create DSC task(s) or ensure each golden record creation has a corresponding DSC task.
  • Process DSC task changes (for example, change in survived values and exclusion of records).
  • Process deletions in the staging area.

Integrated matching only deals with records within the staging area. For more information about the staging area, see the article MDM Staging Area Validation .

However, the following additional services cover additional uses where records are not aimed to be stored in the database.

  • Real time match: A user submits an MDM record and MDM returns a list of similar results. The submitted record is not stored in the staging area but the similar results are.
  • Explain match: A user submits a list of MDM records and MDM returns how records are matched and survived. This service may work on records already present in the staging area as well as on full records provided as XML documents.
Match and survivorship

Match and survivorship

A Swoosh algorithm is used for the match and survivorship process. The algorithm matches and survives records at the same time, which is more efficient than comparing each record one by one since it limits the number of match operations between records.

For more information about the Swoosh algorithm, see this PDF . Note that MDM uses the "R-Swoosh" variant.

The Swoosh algorithm works on table-like structures, not on trees (although this is not specific Swoosh because many other algorithms do not address this). This brings a few limitations (1):

  • Cannot use repeatable elements.
  • Cannot use an element when at least one of the parent elements is a repeatable element.

MDM Studio does not allow users to select match fields that break these rules.

Why such limitations?

Repeatable elements will introduce "duplicates" because joins will be used.

For example, if a Product instance has two colors ("Blue" and "Red"), a projection of this Product instance will give

id color
1 Blue
1 Red

More generally speaking, selecting Root/field0/.../fieldN for match leads to ((Root x field0) x ... ) x fieldN (x denotes a cartesian product). Since the match algorithm works per row, having repeated row will lead to issues that cannot be addressed in Release 5.4 (hence the limitations exposed in (1)).

Since MDM does not survive all fields due to limitations(1), it needs a way to survive values that cannot be handled by the match and survivorship process.

Match limits

MDM prevents "out of control" matches which refer to suspicious match rules that merge too many records together. By default, match & merge stops if a group size exceeds 50 records because it is not expected to have 50 records in a single group. Although this limit can sound quite arbitrary, experience shows groups rarely hold so many records in a group.

Specifically, huge groups have the following issues:

  • It makes stewardship really hard to understand or perform. For example, a group of 50 records means it would create a DSC task with 50 columns to review.
  • Records are not expected to be so similar in the staging area.

Now, if you want to allow MDM to create bigger groups, there is an mdm.conf property you may use ("staging.matchmerge.maxgroupsize").

If you increase its number and run into issues (performance issues for example), you should change it back to 50.

Confidence

The Swoosh algorithm gives a binary answer for a match (yes or no). MDM adds another layer on top of the match algorithm to introduce a confidence score.

Consider you have a list of tuples composed of (match algorithm, weight) for a type: (m0, w0), (m1, w1), ... , (mN, wN). The confidence score for a match of two records (r1 and r2 and attribute a12...a1i) is computed using:

Where "a1i" is the ith attribute of record r1, and "a2i" the ith attribute of r2.

This formula normalizes all match scores to a value between 0 and 1 with a weighted match score.

When MDM computes confidence score between two survived records, MDM always takes the lowest confidence score.

Golden record

Simple field survivorship

MDM will survive fields that are not matched using default survivorship rules.

For example:

Customer ( id = 1, fname = "John", lname = "Smith" )
Customer ( id = 2, fname = "Johnny", lname = "Smith" )

can lead to golden record:

Customer ( id = new_uuid, fname = "Johnny", lname = "Smith" )

In this example, only "fname" and "lname" are used to match records together.

However, customer instances used to build the golden record may have different "age" attribute, and "age" was not considered as a valid field for match. This is where "default survivorship rule" applies.

For example, suppose Customer #1 and Customer #2 have different age values:

Customer ( id = 1, fname = "John", lname = "Smith", age = 35 )
Customer ( id = 2, fname = "Johnny", lname = "Smith", age = 36 )

The default survivorship rule for "age" is LARGEST, that is, the largest value will be taken into account for survivorship. The golden record then becomes:

Customer ( id = new_uuid, fname = "Johnny", lname = "Smith", age = 36 )

because 36 is apparently larger than 35.

There are a few built-in default survivorship rules (in case users do not specify all of them):

Type group Default survivorship
Dates MOST_COMMON
Numbers LARGEST
String LONGEST
Boolean PREFER_FALSE

The type groups are shown below:

Types XSD types
Dates DATE, DATETIME, TIME
Numbers

INT, UNSIGNED_INT, INTEGER, NEGATIVE_INTEGER, POSITIVE_INTEGER, NON_NEGATIVE_INTEGER, NON_POSITIVE_INTEGER,

DECIMAL, DOUBLE, UNSIGNED_DOUBLE, BYTE, UNSIGNED_BYTE, LONG, UNSIGNED_LONG, SHORT, UNSIGNED_SHORT, FLOAT

String STRING
Boolean BOOLEAN

Repeatable field survivorship

To survive repeatable fields, MDM proceeds to incremental partial updates with override=false , so if you start with golden record as:

<Customer>

  <id>new_uuid</id>

  <fname>Johnny</fname>

  <lname>Smith</lname>

<Customer>

MDM then survives Customer #1:

<Customer>

  <id>1</id>

  <fname>John</fname>

  <lname>Smith</lname>

  <age>35</age>

  <addresses>

    <address>

      <street>Street #1</street>

      <city>City #1</city>

    </address>

  </addresses>

<Customer>

This gives an updated golden record:

<Customer>

  <id>new_uuid</id>

  <fname>Johnny</fname>

  <lname>Smith</lname>

  <age>35</age>

  <addresses>

    <address>

      <street>Street #1</street>

      <city>City #1</city>

    </address>

  </addresses>

<Customer>

MDM then proceeds to Customer #2:

<Customer>

  <id>2</id>

  <fname>Johnny</fname>

  <lname>Smith</lname>

  <age>36</age>

  <addresses>

    <address>

      <street>Street #2</street>

      <city>City #2</city>

    </address>

  </addresses>

<Customer>

This gives an updated golden record:

<Customer>

  <id>new_uuid</id>

  <fname>Johnny</fname>

  <lname>Smith</lname>

  <age>36</age>

  <addresses>

    <address>

      <street>Street #1</street>

      <city>City #1</city>

    </address>

    <address>

      <street>Street #2</street>

      <city>City #2</city>

    </address>

  </addresses>

<Customer>
DSC task creation

MDM automatically creates a DSC task that keeps track of how the golden record was built.

Consider two example records as follows:

Customer (id = 1, fname = "John", lname = "Smith", age = 35)
Customer (id = 2, fname = "Johnny", lname = "Smith", age = 36)

And the golden record is:

Customer (id = new_uuid, fname = "Johnny", lname = "Smith", age = 36)

This creates a DSC task where the columns are "id", "fname", and "lname". In the "target" record, all values from the golden record are used. In the source records, all values from the group records are taken.

The DSC task contains not only fields used in the match. In fact, it includes all fields that do not break the limitation of the repeatable elements (see (1) for more information).

Group size and DSC tasks

The star number of the DSC task also depends on the number of records inside the group, with a maximum of 5.

DSC task to be resolve d

When MDM creates a golden record with a low confidence (for more information about the confidence score, see the Match and survivorship section described earlier in this article), the DSC task is created in "new" state. In this case, a data steward must resolve the DSC task.

Once the DSC task is resolved, the next validation will detect the DSC task was resolved and will apply values of the DSC task for the resolution.

Real time service

This service allows quick match between a record and the existing golden records in the staging area. Technically, this is a match using the provided record with all validated golden records. This is also a special kind of match since it stops once the submitted record is attached to a group.

When a blocking key is defined, this service jumps to the corresponding value found in the submitted record. For example, if the blocking key is using the field "Customer/State" and the value of the submitted record is "NY", then MDM will only match the submitted record with golden records where "Customer/State = NY".

Explain match service

This service allows users to perform match simulation operations, that is, to simulate a match between records that do not always exist in the staging area. Technically, this is a match using the provided records without storing them in the staging area. This service also returns a result that explains why records were grouped together or not.

You may use the built-in blocking key when defining a match rule. However, the match simulation operations will not take into account the built-in blocking key.

URL for REST APIIntroduction

The MDM server exposes a REST interface you can use to implement the real time match service and explain match service.

Depending on the "Accept" value in the HTTP request (see http://www.wikipedia.org/wiki/Hypertext_Transfer_Protocol for more information about the "Accept" value), the service might respond in different formats. Supported formats are "text/xml", "application/json", and "application/xml".

All operations are described using this convention:

Operation name (1)

HTTP_Request HTTP URL sample (2)

text/xml (3)

Example of XML response

application/json (4)

Example of JSON response

(1) Quick summary of what the operation does.

(2) HTTP command (GET / POST / DELETE / PUT) and an URL sample.

(3) and (4): Samples of responses that depend on the Accept header in the HTTP request. For example, for the HTTP request with the Accept header of "Accept: text/xml", XML documents will be returned.

"Real time match" - REST API d escription

Submit a record for match

POST http://localhost:9000/datamanager/services/tasks/matching/similar/TestDataContainer?model=TestDataModel&type=TestRecordTypeName

The POST body must include a well-formed XML record.

For example, if you use a URL http://localhost:9000/datamanager/services/tasks/matching/similar/Customer?model=Customer&type=Customer , you should pass a POST body with an XML record that complies with the definition of Customer type in the Customer data model. The submitted record will be compared with the records in the staging area of the Customer container.

Example of body:

<Customer>

  <id>1</id>

  <fname>Peter</fname>

  <lname>Smith</lname>

</Customer>

text/xml

<similars>

    <items>

        <confidence>1.0</confidence> <!-- confidence score between the submitted record and the staging area
records-->

        <golden>false</golden> <!-- true is similar record is a golden record, false otherwise -->

        <id>4</id> <!-- id of the similar in the staging area -->

    </items>

    <items>

        <confidence>0.9607142746448517</confidence>

        <golden>false</golden>

        <id>5</id>

    </items>

    <items>

        <confidence>0.9749999880790711</confidence>

        <golden>true</golden>

        <id>602a75da-5f7b-4df3-85c2-2a3dca90249c</id>

    </items>

    <items>

        <confidence>0.9749999880790711</confidence>

        <golden>false</golden>

        <id>9</id>

    </items>

</similars> 

application/json

{
 "similars": {

 "items": [{

 "confidence": 1,

 "golden": false,

 "id": 4

 }, {

 "confidence":
0.9607142746448517,

 "golden": false,

 "id": 5

 }, {

 "confidence":
0.9749999880790711,

 "golden": true,

 "id":
"602a75da-5f7b-4df3-85c2-2a3dca90249c"

 }, {

 "confidence":
0.9749999880790711,

 "golden": false,

 "id": 9

 }]

 }
}
"Explain match " - REST API d escription

Submit records for match simulation

POST http://localhost:9000/datamanager/services/tasks/matching/explain/?model=TestDataModel&type=TestRecordTypeName

The POST body must include several well-formed XML fragments.

For example, if you use URL http://localhost:9000/datamanager/services/tasks/matching/explain/?model=Customer&type=Customer , you should pass a POST body with XML fragments that comply with the definition of Customer type in the Customer data model. The submitted records will be compared with each other ( not with the records in the staging area of the Customer container).

Example of body:

<Customer>

  <id>1</id>

  <fname>Peter</fname>

  <lname>Smith</lname>

</Customer>

<Customer>

  <id>2</id>

  <fname>Peter2</fname>

  <lname>Smith2</lname>

</Customer>

<Customer>

  <id>3</id>

  <fname>Peter3</fname>

  <lname>Smith3</lname>

</Customer>

Explain how existing records were matched

POST http://localhost:9000/datamanager/services/tasks/matching/explain/ContainerName/records/?type=TestRecordTypeName

The body of the POST statement must contain a line-delimited list of ids of records of 'TestRecordTypeName'. The list of ids is required to at least become one record although it only makes sense when size is no smaller than 2. The returned document is the same as above.

For example, if you wish to get an explain for records with id '1', '2' and '3', you should pass the following POST body to the service:

1

2

3

with each id separated by a line feed character.

Explain how an existing group was matched

GET http://localhost:9000/datamanager/services/tasks/matching/explain/ContainerName/groups/? type=TestRecordTypeName&group=1234

This service describes why members of a group were grouped together. In the example URL, the service returns why records of type 'TestRecordTypeName' and in group '1234' were matched together. The returned document is the same as above.