Scenario 2: Searching for matched reference entries for two input columns - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

In this scenario, you are going to use the previous Job with slight modifications on it in order to search two synonym indexes for input data from two columns.

In addition to the index used earlier, another index is used alongside holding the last name data, for example, Correia, Corria, Toum, Toom, toom, Walker, Waker.

To replicate this scenario, open the Job created in the previous section and proceed as follows:

Configuring the components

  1. Double-click tFixedFlowInput to open its Basic settings view.

  2. Next to Edit schema, click the [...] button to open the [Schema] dialog box, and add a second column LASTNAME next to the FIRSTNAME column you have defined in the previous scenario.

    When done, click OK to validate this change and thus close the dialog box.

  3. In the Content field of the Mode area, add more first name and last name data to make the input data read as follows:

    Kristof;Toum
    Chris;Toom
    Tony;Walker
    Anton;Correia
    Jim;Correia
    Jim;Walker
  4. Double-click tSynonymSearch to open its Basic settings view.

  5. Click Sync columns to synchronize the columns of this component with the preceding one and click Yes to propagate the changes to the next component when prompted.

  6. Click the [...] button next to Edit schema to open the [Schema] dialog box, and add two columns to the output schema: matched_fname and matched_lname.

    These columns will hold the matched reference entries in the output flow.

    When done, click OK to validate the setting and accept propagating the changes when prompted.

  7. In the Limit of each group field, type in 10 to replace the one you have defined in the previous scenario.

  8. Under the Columns to search table, click the [+] button to add a second row and define the parameters as follows:

    • In the Input column column, select LASTNAME from the drop-down list.

    • In the Reference output column column, select matched_lname from the drop-down list.

    • In the Index path column, type in, between quotation marks, the path to the synonym index holding the last name entries.

    • In the Search mode column, select Match exact for both input columns. This will match the exact input word against an exact index word.

    • In the Score threshold column, enter 0.9 to filter results and list only terms with higher similarity.

    • Leave the Min similarity and Word distance columns as they are only for the fuzzy modes and the Match partial mode respectively.

    • In the Limit column of this row, leave the default value 5.

Executing the Job

  • Press F6 to run this Job.

    The execution result reads as follows in the console of the Run view.

    From this result, if you take the input data Chris Toom for example, you can see that:

  • this record is recognized as group 2 with a group size equal to 3. This means that 3 pairs of exact match reference entries are found from the two synonym indexes in use. The exact match for the first name are Christian, Christiaan and Christoffel, and the exact match for the last name are toomx3.

  • the SCORES column contains two sub-columns.

    These sub-columns present the matching scores in regards to the matched_fname and to the matched_lname reference columns respectively. Each figure listed in the SCORE column is equal to the sum of the two figures of the same row in the sub-columns of the SCORES column.