Scenario 1: Searching a given index for matched reference entries - 6.1

Talend Components Reference Guide

Version
6.1
Language
English (United States)
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance
Data Quality and Preparation
Design and Development

In this scenario, a three-component Job reads the provided first name data, searches a given synonym index for reference entries that match the input data and then outputs the results.

Create a first-name synonym index for this Job following the procedures outlined in Scenario 2: Creating a synonym index for people names using tMap.

The three components used in this Job are:

  • tFixedFlowInput: this component generates the input data you will match against the reference entries in the synonym index.

  • tSynonymSearch: this component searches for the matched reference entries in the synonym index.

  • tLogRow (found): this component lists the result of this matching search.

Setting up the Job

  1. Drop tFixedFlowInput, tSynonymSearch and tLogRow from the Palette onto the design workspace.

    You can change the displayed name of each of these component as what has been done for the tLogRow component, named found in this scenario. For further information, see Talend Studio User Guide.

  2. Right-click the tFixedFlowInput component to open the contextual menu and select Row > Main.

  3. Drop the link on the tSynonymSearch component to create an connection between these two components.

  4. Do the same thing to connect tSynonymOutput to tLogRow (found).

Configuring the components

  1. Double-click tFixedFlowInput to open its Basic settings view.

  2. Next to the Schema field, click the Edit schema button to open the [Schema] dialog box, add one column and name it FIRSTNAME. When done, click OK to validate these changes and close the dialog box.

  3. In the Mode area, select the Use Inline Content (delimited file) option, and supply the following names in the Content field:

    Kristof
    Chris
    Tony
    Anton
  4. Double-click tSynonymSearch to open its Basic settings view.

  5. Click Sync columns to add the schema columns of its preceding component to the default schema columns of tSynonymSearch.

    When prompted, click Yes to propagate the changes to the next component.

  6. Click the [...] button next to Edit schema to open the [Schema] dialog box, and add one column to the output schema: matched_fname.

    This column will hold the matched reference entries in the output flow.

    When done, click OK to validate the setting and accept propagating the changes when prompted.

  7. In the Limit of each group field, type in 5 to replace the default value.

  8. Under the Columns to search table, click the [+] button to add one row and define the parameters as follows:

    • In the Input column column, select FIRSTNAME from the list of the input columns.

    • In the Reference output column column, select matched_fname from the list of the output columns.

    • In the Index path column, type in the path to the synonym index to be used, between double quotation marks.

    • In the Search mode column, select Match all fuzzy. This will match each word of the input string against similar word of the index string.

    • In the Score threshold column, enter 0.9 to filter results and list only terms with higher similarity.

    • In the Max edits column, select1 to be the allowed edit distance to use.

      With max edit distance 1, you can have only one insertion, deletion or substitution. Any terms within that edit distance from the input data are matched.

    • Leave the Word distance column as it is only for the Match partial mode.

    • In the Limit column, leave the default value 5.

  9. In the Basic settings view of the tLogRow component, select the Table option for better readable display of the Job execution result.

Executing the Job

  • Press F6 to run this Job.

    The execution result reads as follows in the console of the Run view.

    From this result, you can see that each first name of the input string matches a similar word of the index string. For example, the entry Chris from the input flow is found to fuzzy match 3 words in the given synonym index. And this record is recognized as group 2 that has a group size equal to 3, meaning that three matched reference entries are found for this group.

    The SCORE and the SCORES columns present the same values in this scenario because only one input column is used.

    If you want to extract only the input entries that match exactly an index string, select Match exact in the Search mode column in tSynonymSearch basic settings.