Scenario 1: Grouping output data in separate flows according to the minimal distance computed in each record - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a basic Job that compares columns in the input file using the Jaro-Winkler matching method on the lname and fname column and the q-grams matching method on the address1 column. It then groups the output records in three output flows:

  • Uniques: lists the records which group score (minimal distance computed in the record) is equal to 1.

  • Matches: lists the records which group score (minimal distance computed in the record) is higher than the threshold you define in the Confidence threshold field.

  • Suspects: lists the records which group score (minimal distance computed in the record) is below the threshold you define in the Confidence threshold field.

For another scenario that groups the output records in one single output flow, see Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFileInputExcel, tMatchGroup and three tLogRows.

  2. Connect tFileInputExcel to tMatchGroup using the Main link.

  3. Connect tMatchGroup to the three tLogRow components using the Unique rows, Confident groups and Uncertain groups links.

    Warning

    To be able to set three different output flows for the processed records, you must first select the Separate output check box in the Advanced settings view of the tMatchGroup component. For further information, see the section about configuring the tMatchGroup component.

Configuring the input component

The main input file contains eight columns: account_num, lname, fname, mi, address1, city, state_province and postal_code. The data in this input file has problems such as duplication, names spelled differently or wrongly, different information for the same customer.

You can create the input file used in this scenario if you execute the c0 and c1 Jobs included in the data quality demo project, TDQEEDEMOJAVA, you can import from the login window of your Talend Studio. For further information, see the Talend Studio User Guide.

  1. In the Basic settings view of tFileInputExcel, fill in the File Name field by browsing to the input file and set other properties in case they are not stored in the Repository.

  2. Create the schema through the Edit Schema button, if the schema is not already stored in the Repository. Remember to set the data type in the Type column.

Configuring the tMatchGroup component

  1. Double-click tMatchGroup to display the Basic settings view and define the component properties.

  2. Click Sync columns to retrieve the schema from the preceding component.

  3. Click the Edit schema button to view the input and output schema and do any modifications in the output schema, if necessary.

    In the output schema of this component there are few output standard columns that are read-only. For more information, see tMatchGroup properties.

  4. Click OK to close the dialog box.

  5. Click Preview to open the configuration wizard and define the component configuration and the match rule(s).

    You can use the configuration wizard to import match rules created and tested in the studio and stored in the repository, and use them in your match Jobs. For further information, see Importing match rules from the studio repository.

  6. Define the first match rule as the following:

    • In the Key definition table, click the [+] button to add to the list the column(s) on which you want to do the matching operation, lname and fname.

      Note

      When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.

      For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.

    • Click in the cell of the Matching type column and select from the list Jaro-Winkler as the method to be used for the matching operation.

      If you select custom as a matching type, you must set in the Custom Matcher Class column the path pointing to the custom class (external matching algorithm) you need to use. This path is defined by yourself in the library file (.jar file).

    • Click in the cell of the Confidence Weight column to set the numerical weights for the two columns used as key attributes.

    • Click in the cell of the Handle Null column and select the null operator you want to use to handle null attributes in the columns. In this example, select Null Match None in order to have matching results where null values have minimal effect.

    • Set the match probability in the Match Interval field.

  7. Follow the same procedure in the above step to define the second match rule.

    Set the address1 column as an input attribute and select Jaro as the matching type. Select Null Match None as the null operator. And finally set the match probability which can be different from the one set for the first rule.

  8. Set the Hide groups of less than parameter in order to decide what groups to show in the result chart and matching table. This parameter enables you to hide groups of small group size.

  9. Click the Advanced settings tab and set the advanced parameters for the tMatchGroup component as the following:

    • Select the Separate output check box.

      The component will have three separate output flows: Unique rows, Confident groups and Uncertain groups.

      If this check box is not selected, the tMatchGroup component will have only one output flow where it groups all output data. For an example scenario, see Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.

    • Select the Sort the output data by GID check box to sort the output data by their group identifier.

    • Select the Output distance details and Display detailed labels check boxes.

      The component will output the MATCHING_DISTANCES column. This column provides the distance between the input and the master columns giving also the names of the columns against which the records are matched.

  10. Click the Chart button in the wizard to execute the Job in the defined configuration and have the matching results directly in the wizard.

    The matching chart gives a global picture about the duplicates in the analyzed data. The matching table indicates the details of items in each group and colors the groups in accordance with their color in the matching chart.

    The Job conducts an OR match operation on the records. It evaluates the records against the first rule and the records that match are not evaluated against the second rule. The MATCHING_DISTANCES column allows you to understand which rule has been used on what records.  In the yellow data group for example, the Amole Sarah record is matched according to the second rule that uses address1 as a key attribute, whereas the other records in the group are matched according to the first rule which uses the lname and fname as key attributes.

    You can set the Hide groups of less than parameter in order to decide what groups to show in the matching chart and table

Finalizing the Job and executing it

  1. Double-click each of the tLogRow components to display the Basic settings view and define the component properties.

  2. Save your Job and press F6 to execute it.

    You can see that records are grouped together in three different groups. Each record is listed in one of the three groups according to the value of the group score which is the minimal distance computed in the group.

    The identifier for each group, which is of String data type, is listed in the GID column next to the corresponding record. This identifier will be of the data type Long for Jobs that are migrated from older releases. To have the group identifier as String, you must replace the tMatchGroup component in the imported Job with tMatchGroup from the studio Palette.

    The number of records in each of the three output blocks is listed in the GRP_SIZE column and computed only on the master record. The MASTER column indicates with true or false if the corresponding record is a master record or not a master record. The SCORE column lists the calculated distance between the input record and the master record according to the Jaro-Winkler and Jaro matching algorithms.

    The Job evaluates the records against the first rule and the records that match are not evaluated against the second rule.

    All records which group score is between the match interval, 0.95 or 0.85 depending on the applied rule, and the confidence threshold defined in the advanced settings of tMatchGroupare listed in the Suspects output flow.

    All records which group score is above one of the match probabilities are listed in the Matches output flow.

    All records that have a group size equal to 1 is listed in the Uniques output flow.

For another scenario that groups the output records in one single output flow based on a generated functional key, see Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.