Joining movie and director information using a MapReduce Job

Talend Real-time Big Data Platform Getting Started Guide

EnrichVersion
6.3
EnrichProdName
Talend Cloud
Talend Real-Time Big Data Platform
task
Installation and Upgrade
Deployment
Design and Development
Data Quality and Preparation
Administration and Monitoring
EnrichPlatform
Talend Administration Center
Talend CommandLine
Talend ESB
Talend Installer
Talend DQ Portal
Talend Runtime
Talend Studio

This scenario demonstrates:

  1. How to create a Talend MapReduce Job. See Creating the MapReduce Job for details.

  2. How to drop and link the components to be used in a MapReduce Job. See Dropping and linking MapReduce components for details.

  3. How to configure the input components using the related metadata from the Repository. See Configuring the input data for details.

  4. How to configure the transformation to join the input data. See Configuring the data transformation for details.

  5. How to write the transformed data to HDFS. See Writing the output to HDFS for details.

Creating the MapReduce Job

A Talend MapReduce Job allows you to access and use the Talend MapReduce components to visually design MapReduce programs to read, transform or write data.

Prerequisites:

  • You have launched your Talend Studio and opened the Integration perspective.

Proceed as follows to create the MapReduce Job:

  1. In the Repository tree view, expand the Job Designs node, right-click the Big Data Batch node and select Create folder from the contextual menu.

  2. In the [New Folder] wizard, name your Job folder getting_started and click Finish to create your folder.

  3. Right-click the getting_started folder and select Create folder again.

  4. In the [New Folder] wizard, name the new folder to mapreduce and click Finish to create the folder.

  5. Right-click the mapreduce folder and select Create Big Data Batch Job.

  6. In the [New Big Data Batch Job] wizard, select MapReduce from the Framework drop-down list.

  7. Enter a name for this MapReduce Job and other useful information.

    For example, enter aggregate_movie_director_mr in the Name field.

  8. Click Finish to create your Job.

    An empty Job is opened in the Studio.

The MapReduce component Palette is now available in the Studio. You can start to design the Job by leveraging this Palette and the Metadata node in the Repository.

Dropping and linking MapReduce components

You orchestrate the MapReduce components in the Job workspace in order to design a data transformation process that runs in the MapReduce framework.

Prerequisites:

  • You have launched your Talend Studio and opened the Integration perspective.

  • An empty Job has been created as described in Creating the MapReduce Job and is open in the workspace.

Proceed as follows to add and connect the components:

  1. In the Job, enter the name of the component to be used and select this component from the list that appears. In this scenario, the components are two tHDFSInput components, a tFileInputDelimited component, a tMap component, a tHDFSOutput component and a tFileOutputDelimited component.

    • The tHDFSInput and the tFileInputDelimited components are used to load the movie data and the director data, respectively, from HDFS into the data flow of the current Job.

    • The tMap component is used to transform the input data.

    • The tHDFSOuput and the tFileOutputDelimited components write the results into given directories in HDFS.

  2. Double-click the tHDFSInput component to make this label editable and then enter movie to change the label of this component.

  3. Do the same to label tFileInputDelimited to director.

  4. Right click the tHDFSInput component that is labelled movie, then from the contextual menu, select Row > Main and click tMap to connect it to tMap.

    This is the main link through which the movie data is sent to tMap.

  5. Do the same to connect the director tFileInputDelimited component to tMap using the Row > Main link.

    This is the Lookup link through which the director data is sent to tMap as lookup data.

  6. Do the same to connect the tMap component to tHDFSOutput using the Row > Main link, then in the pop-up wizard, name this link to out1 and click OK to validate this change.

  7. Repeat these operations to connect the tMap component to tFileOutputDelimited component using the Row > Main link and name it to reject.

In the workspace, the whole Job looks like this:

Configuring the input data

The tHDFSInput component and the tFileInputDelimited components are configured to load data from HDFS into the Job.

Prerequisites:

  • The source files, movies.csv and directors.txt have been uploaded into HDFS as explained in Uploading files to HDFS.

  • The metadata of the movie.csv file has been set up in the HDFS folder under the Hadoop cluster node in the Repository.

    If you have not done so, see Preparing file metadata to create the metadata.

Once the Job has been created with all the MapReduce components to be used present and linked, you need to configure the input components to properly read data from HDFS.

  1. Expand the Hadoop cluster node under the Metadata node in the Repository and then the my_cdh Hadoop connection node and its child node to display the movies schema metadata node you have set up under the HDFS folder as explained in Preparing file metadata.

  2. Drop this schema metadata node onto the movie tHDFSInput component in the workspace of the Job.

  3. Double-click the movie tHDFSInput component to open its Component view.

    This tHDFSInput has automatically reused the HDFS configuration and the movie metadata from the Repository to define the related parameters in its Basic settings view.

  4. Double-click the director tFileInputDelimited component to open its Component view.

  5. Click the [...] button next to Edit schema to open the schema editor.

  6. Click the [+] button twice to add two rows and in the Column column, rename them to ID and Name, respectively.

  7. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

  8. In the Folder/File field, enter or browse to the directory where the director data is stored. As is explained in Uploading files to HDFS, this data has been written in /user/ychen/input_data/directors.txt.

  9. In Field separator field, enter a comma (,) as this is the separator used by the director data.

The input components are now configured to load the movie data and the director data to the Job.

Configuring the data transformation

The tMap component is configured to join the movie data and the director data.

Once the movie data and the director data are loaded into the Job, you need to configure the tMap component to join them to produce the output you expect.

  1. Double-click tMap to open its Map Editor view.

  2. Drop the movieID column, the title column, the releaseYear column and the url column from the left side onto each of the output flow table.

    On the input side (left side) of the Map Editor, each of the two tables represents one of the input flow, the upper one for the main flow and the lower one for the lookup flow.

    On the output side (right side), the two tables represent the output flows that you named to out1 and reject when you linked tMap to tHDFSOutput and tFileOutputDelimited in Dropping and linking MapReduce components.

  3. On the input side, drop the directorID column from the main flow table to the Expr.key column of the ID row in the lookup flow table.

    This way, the join key between the main flow and the lookup flow is defined.

  4. Drop the directorID column from the main flow table to the reject table on the output side and drop the Name column from the lookup flow table to the out1 table.

    The configuration in the previous two steps describes how the columns of the input data are mapped to the columns of the output data flow.

    From the Schema editor view in the lower part of the editor, you can see the schemas on both sides have been automatically completed.

  5. On the lookup flow table, click the button to display the settings panel for the join operation.

  6. In the Join model row, click the Value column and click the [...] button that is displayed.

    The [Options] window is displayed.

  7. Select Inner join in order to output only the records that contain join keys that exist in both the main flow and lookup flow.

  8. On the reject output flow table, click the button to open the setting panel.

  9. In the Catch Lookup inner join reject row, select true to output the records that are rejected by the inner join performed on the input side.

  10. Click Apply, then click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

The transformation is now configured to complete the movie data with the names of their directors and write the movie records that do not contain any director data into a separate data flow.

Writing the output to HDFS

Two output components are configured to write the expected movie data and the rejected movie data to different directories in HDFS.

Prerequisites:

  • You have ensured that the client machine on which the Talend Jobs are executed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.

    For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.

  • The Hadoop cluster to be used has been properly configured and is running.

  • The administrator of the cluster has given read/write rights and permissions to the username to be used for the access to the related data and directories in HDFS.

After the movie data and the director data have been transformed by tMap, you need to configure the two output components to write the output into HDFS.

  1. Double-click the tHDFSOutput component, which receives the out1 link.

    Its Basic settings view is opened in the lower part of the Studio.

  2. In the Folder field, enter or browse to the directory you need to write the result in. In this scenario, it is /user/ychen/output_data/mapreduce/out, which receives the records that contain the names of the movie directors.

  3. Select Overwrite from the Action drop-down list.

    This way, the target directory is overwritten if it exists.

  4. Select the Merge result to single file check box in order to merge the part- files typically generated by MapReduce into one single file.

    The Merge file path field is displayed.

  5. In the Merge file path field, enter or browse to the file into which you want the part- files to merge.

    In this scenario, this file is /user/ychen/output_data/mapreduce/out/merged.

  6. Repeat the same operations to configure the tFileOutputDelimited component, which receives the reject link, but set the directory, in the Folder field, to /user/ychen/output_data/mapreduce/reject and leave the Merge result to single file check box clear.

  7. In the Run view, click the Hadoop configuration tab to verify that the Hadoop connection metadata has been properly imported from the Repository.

    You always need to use this Hadoop Configuration tab to define the connection to a given Hadoop distribution for the whole MapReduce Job and this connection is effective on a per-Job basis.

  8. Press F6 to run the Job.

    The Run view is automatically opened in the lower part of the Studio and shows the execution progress of this Job.

    The Job itself also shows the progress graphically.

Once done, you can check, for example in the web console of your HDFS system, that the output has been written in HDFS.

A merged file has also been created.