Creating the match rule to group similar records - 7.3

Data Stewardship

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Stewardship
Talend Studio
Content
Data Governance > Third-party systems > Data Stewardship components
Data Quality and Preparation > Third-party systems > Data Stewardship components
Design and Development > Third-party systems > Data Stewardship components
Last publication date
2024-02-21
Configure the tMatchGroup component to group potential duplicates together based on matching algorithms. This component uses group identifiers to identify records which should be grouped together.

Procedure

  1. Double-click tMatchGroup to open the configuration wizard where you can define the match rule.
  2. In the Key Definition table, define what match algorithms to use and on what columns. Similarly, in the Blocking Selection table, select what column to use as a blocking value in order to reduce the number of pairs that need to be examined.
    For further information, see tMatchGroup.
  3. Click the Chart button to have the matching results in the wizard and then click OK.
  4. In the component properties, click Advanced settings and make sure the Sort output data by GID check box is selected.
    Note: If this option is not enabled, potential duplicates could be grouped in different tasks when loaded to Talend Data Stewardship.
  5. Double-click tMap to open its editor.
  6. Map the input data flow to the output flow and the GID and MASTER columns to TDS_GID and TDS_MASTER respectively.
    For further information about tMap, see tMap Standard properties.
  7. When data comes from a single source, enter the source name for the TDS_SOURCE column in the right-hand table, CRM in this example. Make sure that the source name does not contain dots and that it does not start with a dollar sign.
    If you do not specify a source name, Source 1, Source 2 and so on are added by default.
  8. If you need to store the matching results in an external system, map GID to TDS_EXTERNAL_ID.
    This helps you reference a given task from the external system.
  9. When data comes from different sources and if the input schema has a column which holds the source names, map the source column to TDS_SOURCE.

    If you do not specify the source names, Source 1, Source 2 and so on are added by default.

    If you specify the same name in multiple sources of the same tasks, the suffixes -1, -2 and so on are added by default. For example, if you create a task with three sources SAP, the source names in Talend Data Stewardship are displayed as SAP, SAP - 1, SAP - 2.

    You can also compute dynamically the trust scores of specific records if you provide them at the task source level and map them to the TDS_RATING output column in tDataStewardshipTaskOutput. These trust scores override the scores defined at campaign creation, if any.

    Make sure that the source names in the input file do not contain dots and that they do not start with a dollar sign.

  10. Click OK.