Scenario 1: Creating a synonym index for city names - 6.1

Talend Components Reference Guide

English (United States)
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
Talend Studio
Data Governance
Data Quality and Preparation
Design and Development

In this scenario, a three-component Job creates an index of the standardized city names that provides references to the city synonyms used in the client data of an enterprise.

To create this index, you need a source file to provide the city names and their corresponding synonyms. In this scenario, this is a .csv file and reads as follows:

North Reading;Redding|North Reading|N. Reading|N Reading|N Redding|NR
Young America;YA|Young America
New York;NY|New York

Two columns are found in this file:

  • the left one is the CityName column which holds the standard city names as reference data.

  • the right one is the Synonyms column which holds various synonyms collected across the client data of this enterprise.

The three components used in this Job are:

  • tFileInputDelimited: this component loads data from the source file and inputs them to tSynonymOutput.

  • tSynonymOutput: this component creates the index of interest in this scenario and feed it with the synonyms given in the source file.

  • tLogRow: this component lists the data that have been inserted into the newly created index.

Setting up the Job

To replicate this scenario, proceed as follows:

  1. Drop tFileInputDelimited, tSynonymOutput and tLogRow from the Palette onto the design workspace.

    You can change the displayed name of each of these component as what has been done for the tFileInputDelimited component, which appears as CityNames in this scenario. For further information, see Talend Studio User Guide.

  2. Right-click the tFileInputDelimited (CityNames) component to open the contextual menu.

  3. From this menu, select Row > Main.

  4. Click the tSynonymOutput component to create an connection between these two components.

  5. Do the same thing to connect tSynonymOutput to tLogRow.

Configuring the components

  1. Double click tFileInputDelimited (CityNames) to open its Basic settings view.

  2. In the File name/Stream field, specify the path to the input file.

  3. Click the [...] button next to Edit schema to open the [Schema] dialog box, click the [+] button twice to add two columns, and name them respectively CityName and Synonyms corresponding to the input file structure.

    When done, click OK to close the dialog box and propagate the schema setting to the next component.

    You can also add this tFileInputDelimited file using the established metadata stored in the Repository. This allows you to use automatically the configuration of the corresponding metadata. For further information about how to create and use this metadata, see Talend Studio User Guide.

  4. Double-click tSynonymOutput to open its Basic settings view.

  5. In the Index path field, type in or browse to the location where you need to create the index.

  6. In the Operation field, select the operation you need to perform on this created index as well as the related synonyms. In this example, select (Delete and) initialize an index.

  7. In the Entry field, select the column to be used to receive and store the standard reference data. In the source file used in this scenario, the CityName column is holding the standard city names, so select CityName.

  8. In the Synonyms field, select the column to be used to receive and store the synonyms. In this scenario, select Synonyms.

  9. In the Basic settings view of the tLogRow component, select the Table option for better readable display of the Job execution result.

Executing the Job

  • Press F6 to run this Job.

    An index is created in the specified directory, and the city names and their synonyms are inserted into the index. These entries, along with their status, are displayed on the Console.