Configuring the input data for Pig - 6.5

Talend Open Studio for Big Data Getting Started Guide

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Open Studio for Big Data
task
Design and Development
Installation and Upgrade
EnrichPlatform
Talend Studio
Two tPigLoad components are configured to load data from HDFS into the Job.

Before you begin

  • The source files, movies.csv and directors.txt have been uploaded into HDFS as explained in Uploading files to HDFS.

  • The metadata of the movie.csv file has been set up in the HDFS folder under the Hadoop cluster node in the Repository.

    If you have not done so, see Preparing file metadata to create the metadata.

Procedure

  1. Expand the Hadoop cluster node under the Metadata node in the Repository and then the my_cdh Hadoop connection node and its child node to display the movies schema metadata node you have set up under the HDFS folder as explained in Preparing file metadata.
  2. Drop this schema metadata node onto the movie tPigLoad component in the workspace of the Job.
  3. Double-click the movie tPigLoad component to open its Component view.

    This tPigLoad has automatically reused the HDFS configuration and the movie metadata from the Repository to define the related parameters in its Basic settings view.

  4. From the Load function drop-down list, select PigStorage to use the PigStorage function, a built-in function from Pig, to load the movie data as a structured text file. For further information about the PigStorage function of Pig, see PigStorage.
  5. From the Hadoop connection node called my_cdh in the Repository, drop the cdh_hdfs HDFS connection node under the HDFS folder onto the tPigLoad component labelled director in the workspace of the Job.

    This applies the configuration of the HDFS connection you have created in the Repository on the HDFS-related settings in the current tPigLoad component.

  6. Double-click the director tPigLoad component to open its Component view.

    This tPigLoad has automatically reused the HDFS configuration from the Repository to define the related parameters in its Basic settings view.

  7. Click the [...] button next to Edit schema to open the schema editor.
  8. Click the [+] button twice to add two rows and in the Column column, rename them to ID and Name, respectively.
  9. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.
  10. From the Load function drop-down list, select PigStorage to use the PigStorage function.
  11. In the Input file URI field, enter the directory where the data about the director data is stored. As is explained in Uploading files to HDFS, this data has been written in /user/ychen/input_data/directors.txt.
  12. Click the Field separator field to open the Edit parameter using repository dialog box to update the field separator.

    You need to change this field separator because this tPigLoad is reusing the default separator, a semicolon (;), you have defined for the HDFS metadata while the director data is actually using a comma (,) as separator.

  13. Select Change to built-in property and click OK to validate your choice.

    The Field separator field becomes editable.

  14. Enter a comma within double quotation marks.

Results

The tPigLoad components are now configured to load the movie data and the director data to the Job.