How a Talend MapReduce Job works - 6.3

Talend Data Fabric Studio User Guide

English (United States)
Talend Data Fabric
Talend Studio
Data Quality and Preparation
Design and Development

In Talend Studio, you design a MapReduce Job using the dedicated MapReduce components and configure the connection to the Hadoop cluster to be used. At runtime, this configuration allows the Studio to invoke the client API provided by Hadoop (the API package is org.apache.hadoop.mapred) and then to submit the MapReduce Job to the ResouceManager service of the Hadoop cluster being used and copy the related Job resources to the distributed file system of the same cluster. The Hadoop cluster then completes the rest of the execution such as initializing the Job, generating Job ID and sending the execution progress information and the result back to the Studio.

Note that a Talend MapReduce Job is not equivalent to a MapReduce job explained in Apache's documentation about MapReduce. A Talend MapReduce Job generates one or several MapReduce programs (jobs in Apache's sense), depending on the way you design the Talend Job in the workspace of the Studio. When you create the Job, the progress bars that appear with the MapReduce components dropped in the workspace indicate how the MapReduce programs are generated and the execution progress of each map or reduce computation. The following image presents an example of a Talend MapReduce Job:

The progress bars below the components indicate when and how many map or reduce programs will be run during the execution process. At runtime, they also show the execution progress of each of the programs.

The execution information of a Talend MapReduce Job is logged by the JobHistory service of the used Hadoop cluster. For this reason, you can consult the web console of this service for that information. The name of the Job in the console is automatically constructed to be ProjectName_JobName_JobVersion_FirstComponentName_ComponentID, for example, LOCALPROJECT_wordcount_0.1_tHDFSInput_1.