The Pig components to be used are orchestrated in the Job workspace to compose a
Pig process for data transformation.
Procedure
-
In the Job, enter the name of the component to be used and select
this component from the list that appears. In this scenario, the components are
two tPigLoad components, a tPigMap component and two tPigStoreResult components.
-
The two tPigLoad
components are used to load the movie data and the director data,
respectively, from HDFS into the data flow of the current Job.
-
The tPigMap component
is used to transform the input data.
-
The tPigStoreResult
components write the results into given directories in HDFS.
-
Double-click the label of one of the tPigLoad component to make this label editable and then enter
movie to change the label of this
tPigLoad.
-
Do the same to label another tPigLoad component to director.
-
Right click the tPigLoad
component that is labelled movie, then from
the contextual menu, select Row > Pig
combine and click tPigMap to
connect this tPigLoad to the tPigMap component. This is the main link through
which the movie data is sent to tPigMap.
-
Do the same to connect the director
tPigLoad component to tPigMap using the Row >
Pig combine link. This is the Lookup link through which the director data is sent to
tPigMap as lookup data.
-
Do the same to connect the tPigMap component to tPigStoreResult using the Row > Pig
combine link, then in the pop-up wizard, name this link to
out1 and click OK to validate this change.
-
Repeat these operations to connect the tPigMap component to another tPigStoreResult component using the Row
> Pig combine link and name it to reject.
Results
Now the whole Job looks as follows in the workspace: