Configuring the grouping of the output data - 7.1

Identification

author
Talend Documentation Team
EnrichVersion
7.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Identification components
Data Quality and Preparation > Third-party systems > Data Quality components > Identification components
Design and Development > Third-party systems > Data Quality components > Identification components
EnrichPlatform
Talend Studio

Procedure

  1. Click the tMatchGroup component, and then in its basic settings click the Edit schema button to view the input and output columns and do any modifications in the output schema, if needed.
    In the output schema of this component, there are output standard columns that are read-only. For more information, see tMatchGroup Standard properties.
  2. Click OK to close the dialog box.
  3. Double-click the tMatchGroup component to display its Configuration Wizard and define the component properties.
    If you want to add a fixed output column, MATCHING_DISTANCES, which gives the details of the distance between each column, click the Advanced settings tab and select the Output distance details check box. For more information, see tMatchGroup Standard properties.
  4. In the Key definition table, click the plus button to add to the list the columns on which you want to do the matching operation, FirstName and LastName in this scenario.
  5. Click in the first and second cells of the Matching Function column and select from the list the algorithm(s) to be used for the matching operation, Jaro-Winkler in this example.
  6. Click in the first and second cells of the Weight column and set the numerical weights for each of the columns used as key attributes.
  7. In the Match threshold field, enter the match probability threshold. Two data records match when the probability threshold is above this value.
  8. Click the plus button below the Blocking Selection table to add a line in the table, then click in the line and select from the list the column you want to use as a blocking value, T_GEN_KEY in this example.
    Using a blocking value reduces the number of pairs of records that needs to be examined. The input data is partitioned into exhaustive blocks based on the functional key. This will decrease the number of pairs to compare, as comparison is restricted to record pairs within each block.
  9. Click the Chart button in the top right corner of the wizard to execute the Job in the defined configuration and have the matching results directly in the wizard.
    The matching chart gives a global picture about the duplicates in the analyzed data. The matching table indicates the details of items in each group and colors the groups in accordance with their color in the matching chart.