tSynonymOutput - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Warning

This component will be available in the Palette of Talend Studio on the condition that you have subscribed to one of the Talend Platform products.

tSynonymOutput Properties

Component family

Data Quality

 

Function

tSynonymOutput creates a Lucene index and feed it with entries as well as the related synonyms it receives.

For further information about how to access and manage the words and the reference entries (documents) of an existing synonym index using the synonym index editor, see the Talend Studio User Guide.

For further information about available synonym indexes, see the appendix about data synonym dictionaries in the Talend Studio User Guide.

Note

The synonym similarity computation is enhanced since the Studio version 5.1. If your indexes were created with version 5.0 or lower and you need to handle them using this enhanced computation method, you have to update these indexes by executing the IndexMigrator.jar file downloadable from: http://talendforge.org/svn/top/trunk/org.talend.dataquality.standardization.migration/dist/IndexMigrator.jar. The command to be used to run this jar file is:

java -jar IndexMigrator.jar <inputPath> <outputPath(optional)> 

Purpose

tSynonymOutput creates synonym indexes that some components like tStandardizeRow or tSynonymSearch can refer to when processing data.

Basic settings

Schema and Edit schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.

 

 

Built-in: The schema will be created and stored locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: The schema already exists and is stored in the Repository, hence can be reused in various projects and job designs. Related topic: see Talend Studio User Guide.

 

Index path

Type in or browse to the location where you want to create and store the synonym index. If the specified directory does not exist, the component will create it.

 

Operations

Select the index operation to be performed in directory given in the Index path field.

(Delete and) initialize an index: creates a new index and then fills it with the entries and the corresponding synonyms; if an index already exists, deletes it before creating a new one.

Insert new documents: inserts new entries and synonyms into the given existing index. Duplicates are not inserted.

Update existing documents and insert if not existing: updates existing entries and synonyms, and adds new ones to the given index.

Delete existing documents: deletes the entries with their synonyms if the same entries are identified in the incoming data flow from the preceding component.

 

Entry

Select the column you need to insert to create the entries of the given index. These entries are used as reference to any associated synonyms to be inserted alongside in this given index.

 

Synonyms

Select the column you need to insert to create the synonyms corresponding to different index entries.

 

Synonym separator

Type in the separator to be used to separate the synonyms of each index entry. By default, this separator is |.

Advanced settings

tStatCatcher Statistics

Select this check box to collect log data at the Job and the component levels.

Connections

Outgoing links (from this component to another):

Row: Main; Reject

Trigger: Run if; On Component Ok; On Component Error.

Incoming links (from one component to this one):

Row: Main; Reject

For further information regarding connections, see Talend Studio User Guide.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component needs incoming data from the preceding component for creating or updating indexes.

Scenario 1: Creating a synonym index for city names

In this scenario, a three-component Job creates an index of the standardized city names that provides references to the city synonyms used in the client data of an enterprise.

To create this index, you need a source file to provide the city names and their corresponding synonyms. In this scenario, this is a .csv file and reads as follows:

CityName;Synonyms
North Reading;Redding|North Reading|N. Reading|N Reading|N Redding|NR
Young America;YA|Young America
Dedham;Dedham|dedham|deadham
New York;NY|New York

Two columns are found in this file:

  • the left one is the CityName column which holds the standard city names as reference data.

  • the right one is the Synonyms column which holds various synonyms collected across the client data of this enterprise.

The three components used in this Job are:

  • tFileInputDelimited: this component loads data from the source file and inputs them to tSynonymOutput.

  • tSynonymOutput: this component creates the index of interest in this scenario and feed it with the synonyms given in the source file.

  • tLogRow: this component lists the data that have been inserted into the newly created index.

Setting up the Job

To replicate this scenario, proceed as follows:

  1. Drop tFileInputDelimited, tSynonymOutput and tLogRow from the Palette onto the design workspace.

    You can change the displayed name of each of these component as what has been done for the tFileInputDelimited component, which appears as CityNames in this scenario. For further information, see Talend Studio User Guide.

  2. Right-click the tFileInputDelimited (CityNames) component to open the contextual menu.

  3. From this menu, select Row > Main.

  4. Click the tSynonymOutput component to create an connection between these two components.

  5. Do the same thing to connect tSynonymOutput to tLogRow.

Configuring the components

  1. Double click tFileInputDelimited (CityNames) to open its Basic settings view.

  2. In the File name/Stream field, specify the path to the input file.

  3. Click the [...] button next to Edit schema to open the [Schema] dialog box, click the [+] button twice to add two columns, and name them respectively CityName and Synonyms corresponding to the input file structure.

    When done, click OK to close the dialog box and propagate the schema setting to the next component.

    You can also add this tFileInputDelimited file using the established metadata stored in the Repository. This allows you to use automatically the configuration of the corresponding metadata. For further information about how to create and use this metadata, see Talend Studio User Guide.

  4. Double-click tSynonymOutput to open its Basic settings view.

  5. In the Index path field, type in or browse to the location where you need to create the index.

  6. In the Operation field, select the operation you need to perform on this created index as well as the related synonyms. In this example, select (Delete and) initialize an index.

  7. In the Entry field, select the column to be used to receive and store the standard reference data. In the source file used in this scenario, the CityName column is holding the standard city names, so select CityName.

  8. In the Synonyms field, select the column to be used to receive and store the synonyms. In this scenario, select Synonyms.

  9. In the Basic settings view of the tLogRow component, select the Table option for better readable display of the Job execution result.

Executing the Job

  • Press F6 to run this Job.

    An index is created in the specified directory, and the city names and their synonyms are inserted into the index. These entries, along with their status, are displayed on the Console.

Scenario 2: Creating a synonym index for people names using tMap

In this scenario, a four-component Job creates an index storing people names and their relative nicknames.

The source data to be used in this scenario is stored in a .csv file, an extract of which is shown below:

Country;FirstName;Nickname1;Nickname2;Nickname3;Nickname4
France;Anne;Ninon;Annie;Ninette;Ann
France;Bernadette;Nad;Netty;Dadette
France;Albert;Al
France;Alexandre;Alex
France;Alfred-Hubert;Alu
France;Andrew;Andy
France;Anthony;Anton;Tony;Tonio
France;Artus;Artie
France;Benoit;Ben
France;Catherine;Cate;Katherine;Kathryn
France;Charles;Charlie;Charlot;Chuck
France;Christophe;Christian;Chris;Kris;Kristof
France;Christian;Chris

This data describes people's home country (not to be inserted into the index), first names (reference entries) and frequently used nicknames (synonyms).

The four components used in this Job are:

  • tFileInputDelimited: this component reads the source data and inputs them to tSynonymOutput.

  • tMap: this component is used to transform the source data into two separated columns representing the first names and the nicknames, in the meantime, ignoring the people's home country information.

  • tSynonymOutput: this component creates the index of interest in this scenario and feeds it with the synonyms given in the source file.

  • tLogRow: this component lists the data that have been inserted into the newly created index.

Setting up the Job

To replicate this scenario, proceed as follows:

  1. Drop tFileInputDelimited, tMap, tSynonymOutput and tLogRow from the Palette onto the design workspace.

    You can change the displayed name of each of these component. For further information, see Talend Studio User Guide.

  2. Right-click the tFileInputDelimited component to open the contextual menu, and select Row > Main to connect it with the tMap component.

  3. Do the same thing to connect tMap to tSynonymOutput using Row > Main link.

    A dialog box pops up to prompt you to name this link you are creating.

  4. Type in synonyms, for example, then click OK to validate this name and thus close this dialog box.

  5. Continue to connect tSynonymOutput to tLogRow using Row > Main link again.

Configuring the components

Configure the data input

  1. Double-click tFileInputDelimited to open its Component view.

  2. In the File name/Stream field, specify the path to the input file.

  3. Click the [...] button next to Edit schema to open the [Schema] dialog box, click the [+] button to add six columns and name them Country, FirstName, Nickname1, Nickname2, Nickname3 and Nickname4 corresponding to the input file structure.

    When done, click OK to close the dialog box and propagate the schema setting to the next component.

    You can also add this tFileInputDelimited file using the established metadata stored in the Repository. This allows you to use automatically the configuration of the corresponding metadata. For further information about how to create and use this metadata, see Talend Studio User Guide.

Configure data structure transformation

  1. Double-click tMap to open the map editor.

  2. At the bottom right corner (synonyms) of the Schema editor view, click the [+] button to add two rows and name them FirstName and Nicknames. These two columns appear in the synonyms table on the right side of the map editor.

  3. On the input side (left) of the upper part, select the FirstName column and drop it to the FirstName column on the output side (right).

  4. In the Expression field of the Nicknames column on the output side (right), type in DqStringHandling.safeConcat('|',).

  5. On the input side (left) of the upper part, select sequentially the columns from Nickname1 to Nickname4 and drop them to the Nicknames columns, and edit the expression in the Expression field so that it reads DqStringHandling.safeConcat('|', row1.Nickname1, row1.Nickname2, row1.Nickname3, row1.Nickname4).

  6. Click OK to validate these changes and accept the propagation prompted by the dialog box that pops up.

Configure index creation and console output

  1. Double-click tSynonymOutput to open its Basic settings view.

  2. In the Index path field, type in or browse to the location where you need to create the index.

  3. In the Operation field, select the operation you need to perform on this created index as well as the related synonyms. In this example, select (Delete and ) initialize an index.

  4. In the Entry field, select the column to be used to receive and store the reference entries. In this scenario, the FirstName column is holding the reference entries, so select FirstName.

  5. In the Synonyms field, select the column to be used to receive and store the synonyms. In this scenario, select Nicknames.

  6. In the Basic settings view of the tLogRow component, select the Table option for better readable display of the Job execution result.

Executing the Job

  • Press F6 to run this Job.

    The index is created and you can view its contents and the entry status on the Console.