Creating the match rule to group similar records

Creating the match rule to group similar records - 7.3

Data Stewardship

Version

7.3

Language

English

Product

Talend Big Data

Talend Big Data Platform

Talend Data Fabric

Talend Data Integration

Talend Data Management Platform

Talend Data Services Platform

Talend ESB

Talend MDM Platform

Talend Real-Time Big Data Platform

Module

Talend Data Stewardship

Talend Studio

Content

Data Governance > Third-party systems > Data Stewardship components

Data Quality and Preparation > Third-party systems > Data Stewardship components

Design and Development > Third-party systems > Data Stewardship components

Last publication date

2024-02-21

Configure the tMatchGroup component to group potential duplicates together based on matching algorithms. This component uses group identifiers to identify records which should be grouped together.

Procedure

Double-click tMatchGroup to open the configuration wizard where you can define the match rule.
In the Key Definition table, define what match algorithms to use and on what columns. Similarly, in the Blocking Selection table, select what column to use as a blocking value in order to reduce the number of pairs that need to be examined.
For further information, see tMatchGroup.
Click the Chart button to have the matching results in the wizard and then click OK.
In the component properties, click Advanced settings and make sure the Sort output data by GID check box is selected.

Note: If this option is not enabled, potential duplicates could be grouped in different tasks when loaded to Talend Data Stewardship.
Double-click tMap to open its editor.
Map the input data flow to the output flow and the GID and MASTER columns to TDS_GID and TDS_MASTER respectively.
For further information about tMap, see tMap Standard properties.
When data comes from a single source, enter the source name for the TDS_SOURCE column in the right-hand table, CRM in this example. Make sure that the source name does not contain dots and that it does not start with a dollar sign.
If you do not specify a source name, Source 1, Source 2 and so on are added by default.
If you need to store the matching results in an external system, map GID to TDS_EXTERNAL_ID.
This helps you reference a given task from the external system.
When data comes from different sources and if the input schema has a column which holds the source names, map the source column to TDS_SOURCE.

If you do not specify the source names, Source 1, Source 2 and so on are added by default.

If you specify the same name in multiple sources of the same tasks, the suffixes -1, -2 and so on are added by default. For example, if you create a task with three sources SAP, the source names in Talend Data Stewardship are displayed as SAP, SAP - 1, SAP - 2.

You can also compute dynamically the trust scores of specific records if you provide them at the task source level and map them to the TDS_RATING output column in tDataStewardshipTaskOutput. These trust scores override the scores defined at campaign creation, if any.

Make sure that the source names in the input file do not contain dots and that they do not start with a dollar sign.
Click OK.