Grouping the duplicate records

Deduplication

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Deduplication components
Data Quality and Preparation > Third-party systems > Data Quality components > Deduplication components
Design and Development > Third-party systems > Data Quality components > Deduplication components
EnrichPlatform
Talend Studio

Procedure

  1. Right-click tMatchGroup to open its contextual menu and select Configuration Wizard.
    From the wizard, you can see how your groups look like and you can adjust the component settings in order to correctly get the similar matches.
  2. Click the plus button under the Key Definition table to add one row.
  3. In the Input Key Attribute column of this row, select acctName. This way, this column becomes the reference used to match the duplicates of the input data.
  4. In the Matching Function column, select the Jaro-Winkler matching algorithm.
  5. In the Match threshold field, enter the numerical value to indicate at which value two record fields match each other. In this example, type in 0.6.
  6. Click Chart to execute this matching rule and show the result in this wizard.
    If the input records are not put into one single group, replace 0.6 with a smaller value and click Chart again to check the result until all of the four records are in the same group.
    The Job in this scenario puts four similar records into one single duplicates group so that tRuleSurvivorship is able to create one survivor from them. This simple sample allows you to have a clear picture about how tRuleSurvivorship works along with other components to create the best data. However, in the real-world case, you may need to process much more data with complex duplicate situation and thus put the data into much more groups.
  7. Click OK to close this Configuration wizard and the Basic settings view of the tMatchGroup component is automatically filled with the parameters you have set.
    For further information about the Configuration wizard, see Configuration wizard