Amazon EMR - Big Data Batch Jobs - 7.3

Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Studio
Design and Development > Designing Jobs > Hadoop distributions > Amazon EMR
Design and Development > Designing Jobs > Job Frameworks > MapReduce
Design and Development > Designing Jobs > Job Frameworks > Spark Batch

Amazon EMR - Big Data Batch Jobs

This article shows how to run a Talend Big Data Batch Job on an Amazon EMR cluster.
  • Amazon EC2
  • Amazon EMR
  • Amazon RDS

A common requirement is to supplement data in a MySQL database with data from another source, such as a file hosted on HDFS, or a file on the local file system.

Create a Big Data Batch Job


  1. In the Repository, right-click Job designs and click Create Big Data Batch Job.
  2. In the wizard, you will provide Name, Purpose and Description, as usual.
  3. To define the Job as a Spark Batch job, in the Framework list, select Spark Batch:

    Then you will be able to design your Job as any other Talend Job.

Configure a MapReduce framework (deprecated)


  1. To configure your Job to run on your Amazon EMR cluster, open the Run view.
  2. In the Hadoop Configuration tab, you will use your cluster connection metadata. In the Property Type list, select Repository, then browse the Repository to find your Amazon EMR cluster metadata:
  3. Once designed and configured, you can run your Job.

    You can follow the execution of your Job in the Designer or in the Console:

    The Talend Studio allows you to convert Jobs from one framework to another.

    Because Spark allows faster in-memory processing, you may be interested in the article Converting a MapReduce Job to a Spark Job.