Scenario: Normalizing data - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This simple scenario illustrates a Job that normalizes a list of tags for Web forum topics, and displays the result in a table on the Run console.

This list is not well organized and it contains trailing empty strings, leading and trailing whitespace, and repeated tags, as shown below.

ldap,
  db2, jdbc driver,
grid computing,  talend architecture  ,
content, environment,,
tmap,,
eclipse,
database,java,postgresql,
tmap,
database,java,sybase,
deployment,,
repository,
database,informix,java

Setting up the Job

  1. Drop the following components from the Palette to the design workspace: tFileInputDelimited, tNormalize, tLogRow.

  2. Connect the components using Row > Main connections.

Configuring the components

  1. Double-click the tFileInputDelimited component to open its Basic settings view.

  2. In the File name field, specify the path to the input file to be normalized.

  3. Click the [...] button next to Edit schema to open the [Schema] dialog box, and set up the input schema by adding one column named Tags. When done, click OK to validate your schema setup and close the dialog box, leaving the rest of the settings as they are.

  4. Double-click the tNormalize component to open Basic settings view.

  5. Check the schema, and if necessary, click Sync columns to get the schema synchronized with the input component.

  6. Define the column the normalization operation is based on.

    In this use case, the input schema has only one column, Tags, so just accept the default setting.

  7. In the Advanced settings view, select the Get rid of duplicate rows from output, Discard the trailing empty strings, and Trim resulting values check boxes.

  8. In the tLogRow component, select the Print values in the cells of table radio button.

Saving and executing the Job

  1. Press Ctrl+S to save your Job.

  2. Click Run on the Run tab or press F6 to execute the Job.

    The list is tidied up, with duplicate tags, leading and trailing whitespace and trailing empty strings removed, and the result is displayed in a table cell on the console.