Two tPigLoad components are configured to load data from HDFS into the
Job.
Procedure
-
Expand the Hadoop cluster node
under the Metadata node in the Repository and then the my_cdh Hadoop connection node and its child node to display the
movies schema metadata node you have set
up under the HDFS folder as explained in
Preparing file metadata.
-
Drop this schema metadata node onto the movie
tPigLoad component in the workspace of the
Job.
-
Double-click the movie
tPigLoad component to open its Component view.
This tPigLoad has automatically reused
the HDFS configuration and the movie metadata from the Repository to define the related parameters in
its Basic settings view.
-
From the Load function
drop-down list, select PigStorage to use the
PigStorage function, a built-in function from Pig, to load the movie data as a
structured text file. For further information about the PigStorage function of
Pig, see PigStorage.
-
From the Hadoop connection node called my_cdh in the Repository,
drop the cdh_hdfs HDFS connection node under
the HDFS folder onto the tPigLoad component labelled director in the workspace of the Job.
This applies the configuration of the HDFS connection you
have created in the Repository on the
HDFS-related settings in the current tPigLoad component.
-
Double-click the director
tPigLoad component to open its Component view.
This tPigLoad has
automatically reused the HDFS configuration from the Repository to define the related parameters in its
Basic settings view.
-
Click the [...] button next to
Edit schema to open the schema
editor.
-
Click the [+] button twice to
add two rows and in the Column column,
rename them to ID and Name, respectively.
-
Click OK to validate these
changes and accept the propagation prompted by the pop-up dialog box.
-
From the Load function
drop-down list, select PigStorage to use the
PigStorage function.
-
In the Input file URI field,
enter the directory where the data about the director data is stored. As is
explained in Uploading files to HDFS, this data has been written in /user/ychen/input_data/directors.txt.
-
Click the Field separator
field to open the Edit parameter using
repository dialog box to update the field separator.
You need to change this field separator because this tPigLoad is reusing the default separator, a
semicolon (;), you have defined for the HDFS metadata while the director
data is actually using a comma (,) as separator.
-
Select Change to built-in
property and click OK to
validate your choice.
The Field separator
field becomes editable.
-
Enter a comma within double quotation marks.
Results
The tPigLoad components are now
configured to load the movie data and the director data to the Job.