T-Swoosh algorithm

This scenario describes a basic Job that compares columns in the input file using the Jaro-Winkler matching method on the lname and fname columns. It then groups the output records in output flows:

Uniques: lists the records which group size (minimal distance computed in the record) is equal to 1.
Matches: lists the records which group quality is greater than or equal to the threshold you define in the Confident match threshold field.
Suspects: lists the records which group quality is less than the threshold you define in the Confident match threshold field.

The group quality is the minimal distance computed among all record pairs of the group.

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real-Time Big Data Platform, Talend MDM Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

Attention: As the Apache Spark Batch component does not support the T-Swoosh algorithm, this scenario uses the Standard component.

Setting up the Job

Procedure

Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tMatchGroup and three tLogRow components.
Connect tFixedFlowInput to tMatchGroup using the Main link.
Click the tMatchGroup component and select the Separate output check box in the Advanced settings view.
Connect tMatchGroup to the three tLogRow components using the Uniques, Matches and Suspects links.

Configuring the input component

About this task

The input data contain seven columns: lname, fname, middle_name, address, city, state_province and postal_code. The data have problems such as duplication, names spelled differently or wrongly, different information for the same customer.

Procedure

Double-click the tFixedFlowInput component to display its Basic settings view.
Select Built-in and click the […] button next to Edit Schema.
Define the seven columns and click OK.
Select Use Inline Content(delimited file).
Fill in the Row Separator and Field Separator fields.
Enter the input data in the Content field.

Configuring the tMatchGroup component

Procedure

Click the tMatchGroup component to display its Basic settings view.
From the Matching Algorithm list, select T-Swoosh.
Click Sync columns to retrieve the schema from the preceding component.
Click the Edit schema button to view the input and output schema and do any modifications in the output schema, if necessary.
In the output schema of this component there are output standard columns that are read-only. For more information, see the tMatchGroup Standard properties.
Click OK to close the dialog box.
Click the Advanced settings tab and select the following check boxes:
1. Separate output
  
  The component will have three separate output flows: Uniques, Matches and Suspects.
  
  If this check box is not selected, the tMatchGroup component will have only one output flow where it groups all output data. For an example scenario, see Comparing columns and grouping in the output flow duplicate records that have the same functional key in Identification section.
2. Sort the output data by GID
3. Output distance details
4. Display detailed labels
5. Deactivate matching computation when opening the wizard
Click the […] button next to Configure match rules to define the component configuration and the match rule(s).
You can use the configuration wizard to import match rules created and tested in Talend Studio and stored in the repository, and use them in your match Jobs. For further information, see Importing match rules from the repository.

It is important to import or define the same type of the rule selected in the basic settings of the component. Otherwise the Job runs with default values for the parameters which are not compatible between the two algorithms.
Define the match rule as the following:
1. In the Key definition table, click the [+] button to add to the list the column(s) on which you want to do the matching operation, fname and lname.
  
  Information noteNote: When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.
  For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.
2. Click in the Matching Function column and select from the list Jaro-Winkler as the method to be used for the matching operation.
  
  If you select custom as a matching type, you must set in the Custom Matcher column the path pointing to the custom class (external matching algorithm) you need to use. This path is defined by yourself in the library file (.jar file).
3. From the Tokenized measure list, select No.
4. Click in the cell of the Threshold column and enter 0.7 for fname and 0.4 for lname.
5. Click in the cell of the Confidence Weight column to set the numerical weights for the two columns used as key attributes: 1 for fname and 4 for lname.
6. Click in the cell of the Handle Null column and select the null operator you want to use to handle null attributes in the columns. In this example, select Null Match NONE in order to have matching results where null values have minimal effect.
7. Select Most common in the Survivorship Function.
Follow the same procedure in the above step to define the second match rule and set the parameters as follows:
1. Click the [+] button (Duplicate Rule).
2. Input Key Attribute: address
3. Matching Function: Jaro
4. Tokenized Measure: No
5. Threshold: 0.8
6. Confidence Weight: 1
7. Handle Null: Null Match NONE
8. Survivorship Function: Most common
Set the Match Threshold parameter of each Match Rule to 0.8.
Set the Hide groups of less than parameter to 2. This parameter enables you to hide groups of small size.
Click the Chart button to execute the Job in the defined configuration and have the matching results directly in the wizard.

The matching chart gives a global picture about the duplicates in the analyzed data. The matching table indicates the details of items in each group and colors the groups in accordance with their color in the matching chart.

The Job conducts an OR match operation on the records. It evaluates the records against the rule. The MATCHING_DISTANCES column allows you to understand which rule has been used on what records.
For example, in the second data group (brick red), the last Amic record is matched according to the second rule that uses address1 as a key attribute, whereas the other records in the group are matched according to the first rule which uses the lname and fname as key attributes.

As you can see in this example, the value in the GRP_QUALITY column can be less than the Match Threshold parameter. That is because a group is created from record pairs with a matching score greater than or equal to the Match Threshold but the records are not all compared to each other; whereas GRP_QUALITY takes into account all record pairs in the group.

Finalizing the Job and executing it

Procedure

Double-click each tLogRow component to display the Basic settings view.
Select Table (print values in cells of a table).
Save your Job and press F6 to execute it.

Results

You can see that records are grouped together in three different groups. Each record is listed in one of the three groups according to the value of the group score which is the minimal distance computed in the group.

The identifier for each group, which is of String data type, is listed in the GID column next to the corresponding record. This identifier is of the data type Long for Jobs that are migrated from older releases. To have the group identifier as String, replace the tMatchGroup component in the imported Job with tMatchGroup from the Talend Studio Palette.

The number of records in each of the three output blocks is listed in the GRP_SIZE column and computed only on the master record. The MASTER column indicates with true or false whether the corresponding record is a master record or not. The SCORE column lists the calculated distance between the input record and the master record according to the Jaro-Winkler and Jaro matching algorithms.

The Job evaluates the records against the first rule and the records that match are not evaluated against the second rule.

All records with a group score between the match interval, 0.95 or 0.85 depending on the applied rule, and the confidence threshold defined in the advanced settings of tMatchGroup are listed in the Suspects output flow.

All records with a group score greater than or equal to one of the match probabilities are listed in the Matches output flow.

All records with a group size equal to 1 is listed in the Uniques output flow.

For another scenario that groups the output records in one single output flow, see Comparing columns and grouping in the output flow duplicate records that have the same functional key in the Identification section.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here