Scenario 2: Extracting the stems of English words from a specific DB column - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a six-component Job that carries out linguistic normalization on data in the translation column and extract the base part (word stem) of all English words.

The aim of this Job is to create a kind of dictionary of stems of the English words listed in the translation column. This dictionary may be used at a later stage in order to check new words to be put in the selected table. The extracted English stems are written in an output file along with the number of their occurrences in the translation column.

In this scenario, we have already stored the main input schema in the Repository. For more information about storing schema metadata in the Repository, see Talend Studio User Guide.

The main input table contains eight columns: id_key, id_lang, translation, id_status, id_user_trans, id_user_validate, id_editor and date. We want to extract the stem of the English words in the translation column.

Setting up the Job

  1. In the Repository tree view, expand Metadata - DB Connections where you have stored the main input schema and drop the relevant file onto the design workspace.

    The [Components] dialog box displays with the corresponding component selected by default.

  2. Click OK to drop the tMysqlInput component onto the workspace.

    The input table used in this scenario is called translation. It holds several columns including the translation column that holds the English words we want to stem.

  3. Drop the following components from the Palette onto the design workspace: tNormalize, tFilterRow, tStem, tAggregateRow and tFileOutputExcel.

  4. Connect the component together using the Main links with the exception of the tFilterRow - tStem connection that should use a Filter link.

Configuring the data input

  1. Double-click the main input database component to display its Basic settings views.

    The property fields for tMysqlInput are automatically filled in. If you do not define your input schema locally in the Repository, fill in the details manually after selecting Built-in in the Schema and Property Type fields.

  2. If required, modify the query in the Query box.

    In this example, we want to work only on the English words and this is why the id_lang is set to 1.

Configuring the preprocessing process

  1. Double-click tNormalize to display its Basic settings view and define the component properties.

  2. From the Column to normalize list, select translation.

    This will split the data strings in the translation column into words.

  3. In the Item separator field, enter the separator which will delimits data in the translation column, a space character in this example.

  4. Double-click tFilterRow to display its Basic settings view and define the component properties.

  5. Select the logical operator you want to use in order to combine simple filtering and advanced mode.

  6. In the Conditions area, click the plus button to add one or more conditions to the output flow. And then in the corresponding table column:

    • select the input column you want to operate on,

    • select the needed function on the list,

    • select the operator to bind the input column with the value,

    • type in the value for content filtering.

      In this example, we want to filter all words in the translation column that have less than three letters.

Configuring word stem extraction

  1. Double-click tStem to display its Basic settings view and define the component properties.

  2. In the Select Algorithm area, click in the Algorithm cell that corresponds to the translation column. And then select from the list the algorithm language you want to check the column data against, English in this scenario.

Configuring the data output

  1. Double-click tAggregateRow to display its Basic settings view and define the component properties.

  2. Click the [...] button next to Edit schema to open a dialog box. Here you can define the output flow.

  3. In the output flow to the right of the dialog box, click the plus button to add as many columns as you need in the output flow.

    In this example, we want to have two output columns, the translation column and a new output column called count.

    When done, click OK to close the dialog box and proceed to the next step.

  4. In the tAggregateRow basic settings view and in the Group by area, click the plus button to add an many lines as needed. Here you can define the group-by values.

    • Click in the Output column line and select the output column that will hold the aggregated data, the translation column in this example.

    • Click in the Input column position line and select the input column from which you want to collect the values to be aggregated, the translation column in this example.

  5. In the Operations area, click the plus button to add lines for the columns that will hold the aggregated data. Here you can define the calculation values.

    • Click in the Output column line and select the destination column from the list, the translation column in this example.

    • Click in the Function column line and select any of the listed operations.

      In this example, we want to count the number of distinct stems to be listed only once in the output column.

    • Click in the Input column position line and select the input column from which you want to collect the values to be aggregated, the id_key column in this example.

  6. Double-click tFileOutputExcel to display its Basic settings view and define the component properties.

  7. Set the destination file path and define the settings of the file according to your needs.

Executing the Job

  • Save the Job and press F6 to execute it.

    The output Excel file is written in the defined path. This file holds the extracted English word stems in the translation column and the count of each stem in the count column.

    The figure below illustrates an extraction of the output file.