tStem - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Warning

This component will be available in the Palette of Talend Studio on the condition that you have subscribed to one of the Talend Platform products.

tStem properties

Component family

Data quality

 

Function

tStem standardizes data in columns through the process of linguistic normalization, in which the variant forms of a word are reduced to a common form.

Purpose

tStem enables to standardize data in columns before matching this data.

Basic settings

Schema and Edit schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.

 

 

Built-in: You create the schema and store it locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and job designs. Related topic: see Talend Studio User Guide.

 

Select Algorithm

Set a stemming algorithm for each analyzed column of the type string.

Column: list of the analyzed column.

Algorithm: Select from the list the algorithm language you want to check the column data against.

Advanced settings

tStatCatcher Statistics

Select this check box to collect log data at the component level.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is an intermediary step. It requires an input flow as well as an output.

Limitation

n/a

 

Scenario 1: Generating stems for a list of English words

This basic scenario describes a four-component Job that reads a list of English words from a one-column delimited file, extracts the stems of the words, and displays both the list of words and the corresponding stems on the Run console.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFileInputDelimited, tMap, tStem, and tLogRow.

  2. Link the tFileInputDelimited component to the tMap component using a Row > Main connection.

  3. Link the tMap component to the tStem component using a Row > Main connection, and give the output row connection a name, out in this example.

  4. Link the tStem component to the tLogRow component using a Row > Main connection.

Configuring the components

  1. Double-click the tFileInputDelimited component to open its Basic settings view.

  2. Browse to the input file, and set basic properties based on the structure of the input file. In this example, the input file provides a list of English words in different variant forms, and does not have a header. The following is an exact of the file content.

    computerize
    computerized
    computerizing
    program
    programming
    cooking
    cooked
    cooks
    evaporable
  3. Click the [...] button next to Edit schema to open the [Schema] dialog box, and set the input schema, which should contain one column named Word in this example.

    When done, click OK to close the dialog box.

  4. Double-click the tMap component to open the map editor. We will use this component to map the single-column input flow to a two-column data flow to feed the tStem component.

  5. Click the [+] button to add two columns to the output schema and name them Fullform and Stem respectively. Then, drag the Word column from the input table onto the Fullform column, then onto the Stem column, in the output table.

    When done, click OK to close the map editor and propagate the changes to the next component.

  6. Double-click the tStem component to open its Basic settings view.

  7. In the Select Algorithm table, click in the Algorithm field for the Stem column, which will carry the word stems extracted from the input data, and select English as the algorithm language.

  8. Double-click the tLogRow component to open its Basic settings view, and select the Table option for better readable display of the Job execution result.

Executing the Job

  1. Press Ctrl+S to save your Job.

  2. Press F6 or click the Run button on the Run tab to execute the Job.

    The list of words read from the input data and their corresponding stems are displayed on the Run console.

Scenario 2: Extracting the stems of English words from a specific DB column

This scenario describes a six-component Job that carries out linguistic normalization on data in the translation column and extract the base part (word stem) of all English words.

The aim of this Job is to create a kind of dictionary of stems of the English words listed in the translation column. This dictionary may be used at a later stage in order to check new words to be put in the selected table. The extracted English stems are written in an output file along with the number of their occurrences in the translation column.

In this scenario, we have already stored the main input schema in the Repository. For more information about storing schema metadata in the Repository, see Talend Studio User Guide.

The main input table contains eight columns: id_key, id_lang, translation, id_status, id_user_trans, id_user_validate, id_editor and date. We want to extract the stem of the English words in the translation column.

Setting up the Job

  1. In the Repository tree view, expand Metadata - DB Connections where you have stored the main input schema and drop the relevant file onto the design workspace.

    The [Components] dialog box displays with the corresponding component selected by default.

  2. Click OK to drop the tMysqlInput component onto the workspace.

    The input table used in this scenario is called translation. It holds several columns including the translation column that holds the English words we want to stem.

  3. Drop the following components from the Palette onto the design workspace: tNormalize, tFilterRow, tStem, tAggregateRow and tFileOutputExcel.

  4. Connect the component together using the Main links with the exception of the tFilterRow - tStem connection that should use a Filter link.

Configuring the data input

  1. Double-click the main input database component to display its Basic settings views.

    The property fields for tMysqlInput are automatically filled in. If you do not define your input schema locally in the Repository, fill in the details manually after selecting Built-in in the Schema and Property Type fields.

  2. If required, modify the query in the Query box.

    In this example, we want to work only on the English words and this is why the id_lang is set to 1.

Configuring the preprocessing process

  1. Double-click tNormalize to display its Basic settings view and define the component properties.

  2. From the Column to normalize list, select translation.

    This will split the data strings in the translation column into words.

  3. In the Item separator field, enter the separator which will delimits data in the translation column, a space character in this example.

  4. Double-click tFilterRow to display its Basic settings view and define the component properties.

  5. Select the logical operator you want to use in order to combine simple filtering and advanced mode.

  6. In the Conditions area, click the plus button to add one or more conditions to the output flow. And then in the corresponding table column:

    • select the input column you want to operate on,

    • select the needed function on the list,

    • select the operator to bind the input column with the value,

    • type in the value for content filtering.

      In this example, we want to filter all words in the translation column that have less than three letters.

Configuring word stem extraction

  1. Double-click tStem to display its Basic settings view and define the component properties.

  2. In the Select Algorithm area, click in the Algorithm cell that corresponds to the translation column. And then select from the list the algorithm language you want to check the column data against, English in this scenario.

Configuring the data output

  1. Double-click tAggregateRow to display its Basic settings view and define the component properties.

  2. Click the [...] button next to Edit schema to open a dialog box. Here you can define the output flow.

  3. In the output flow to the right of the dialog box, click the plus button to add as many columns as you need in the output flow.

    In this example, we want to have two output columns, the translation column and a new output column called count.

    When done, click OK to close the dialog box and proceed to the next step.

  4. In the tAggregateRow basic settings view and in the Group by area, click the plus button to add an many lines as needed. Here you can define the group-by values.

    • Click in the Output column line and select the output column that will hold the aggregated data, the translation column in this example.

    • Click in the Input column position line and select the input column from which you want to collect the values to be aggregated, the translation column in this example.

  5. In the Operations area, click the plus button to add lines for the columns that will hold the aggregated data. Here you can define the calculation values.

    • Click in the Output column line and select the destination column from the list, the translation column in this example.

    • Click in the Function column line and select any of the listed operations.

      In this example, we want to count the number of distinct stems to be listed only once in the output column.

    • Click in the Input column position line and select the input column from which you want to collect the values to be aggregated, the id_key column in this example.

  6. Double-click tFileOutputExcel to display its Basic settings view and define the component properties.

  7. Set the destination file path and define the settings of the file according to your needs.

Executing the Job

  • Save the Job and press F6 to execute it.

    The output Excel file is written in the defined path. This file holds the extracted English word stems in the translation column and the count of each stem in the count column.

    The figure below illustrates an extraction of the output file.