Scenario: Using a pivot column to aggregate data - 6.3

Talend Open Studio for Big Data Components Reference Guide

Talend Open Studio for Big Data
Data Governance
Data Quality and Preparation
Design and Development
Talend Studio

The following scenario describes a Job that aggregates data from a delimited input file, using a defined pivot column.

Dropping and linking components

  1. Drop the following component from the Palette to the design workspace: tFileInputDelimited, tPivotToColumnsDelimited.

  2. Link the two components using a Row > Main connection.

Configuring the components

Set the input component

  1. Double-click the tFileInputDelimited component to open its Basic settings view.

  2. Browse to the input file to fill out the File Name field.

    The file to use as input file is made of 3 columns, including: ID, Question and the corresponding Answer

  3. Define the Row and Field separators, in this example, respectively: carriage return and semi-colon

  4. As the file contains a header line, define it also.

  5. Set the schema describing the three columns: ID, Questions, Answers.

Set the output component

  1. Double-click the tPivotToColumnsDelimited component to open its Basic settings view.

  2. In the Pivot column field, select the pivot column from the input schema. this is often the column presenting most duplicates (pivot aggregation values).

  3. In the Aggregation column field, select the column from the input schema that should gets aggregated.

  4. In the Aggregation function field, select the function to be used in case duplicates are found out.

  5. In the Group by table, add an Input column, that will be used to group by the aggregation column.

  6. In the File Name field, browse to the output file path. And on the Row and Field separator fields, set the separators for the aggregated output rows and data.

Saving and executing the Job

  1. Press Ctrl+S to save your Job.

  2. Press F6 or click Run on the Run tab to execute the Job.

    The output file shows the newly aggregated data.