Configuring the data transformation for Pig - 6.5

Talend Open Studio for Big Data Getting Started Guide

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Open Studio for Big Data
task
Design and Development
Installation and Upgrade
EnrichPlatform
Talend Studio

The tPigMap component is configured to join the movie data and the director data.

Once the movie data and the director data are loaded into the Job, you need to configure the tPigMap component to join them to produce the output you expect.

Procedure

  1. Double-click tPigMap to open its Map Editor view.
  2. Drop the movieID column, the title column, the releaseYear column and the url column from the left side onto each of the output flow table.

    On the input side (left side) of the Map Editor, each of the two tables represents one of the input flow, the upper one for the main flow and the lower one for the lookup flow.

    On the output side (right side), the two tables represent the output flows that you named to out1 and reject when you linked tPigMap to tPigStoreResult in Dropping and linking components.

  3. On the input side, drop the directorID column from the main flow table to the Expr.key column of the ID row in the lookup flow table.

    This way, the join key between the main flow and the lookup flow is defined.

  4. Drop the directorID column from the main flow table to the reject table on the output side and drop the Name column from the lookup flow table to the out1 table.

    The configuration in the previous two steps describes how the columns of the input data are mapped to the columns of the output data flow.

    From the Schema editor view in the lower part of the editor, you can see the schemas on both sides have been automatically completed.

  5. On the out1 output flow table, click the button to display the editing field for the filter expression.
  6. Enter row1.directorId is not null

    This allows tPigMap to output only the movie records in each of which the directorID field is not empty. A record with an empty directorID field is filtered out.

  7. On the reject output flow table, click the button to open the settings panel.
  8. In the Catch Output Reject row, select true to output the records with empty directorID fields in the reject flow.
  9. Click Apply, then click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

Results

The transformation is now configured to complete the movie data with the names of their directors and write the movie records that do not contain any director data into a separate data flow.