Amazon EMR - Big Data Batch Jobs

EnrichVersion
6.4
6.3
6.2
6.1
6.0
5.6
EnrichProdName
Talend Real-Time Big Data Platform
Talend Data Fabric
Talend Big Data Platform
Talend Big Data
task
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
Design and Development > Designing Jobs > Job Frameworks > MapReduce
Design and Development > Designing Jobs > Hadoop distributions > Amazon EMR
EnrichPlatform
Talend Studio

Amazon EMR - Big Data Batch Jobs

This article shows how to run a Talend Big Data Batch Job using the MapReduce framework on an Amazon EMR cluster.
  • Amazon EC2
  • Amazon EMR
  • Amazon RDS

A common requirement is to supplement data in a MySQL database with data from another source, such as a file hosted on HDFS, or a file on the local file system.

Create a Big Data Batch Job using the Map Reduce framework

Procedure

  1. In the Repository, right-click Job designs and click Create Big Data Batch Job.
  2. In the wizard, you will provide Name, Purpose and Description, as usual.
  3. To define the Job as a MapReduce job, in the Framework list, select MapReduce:

    This is where you would choose the Spark framework if you wanted to create a Big Data Batch Job using the Spark framework.

    Then you will be able to design your Job as any other Talend Job.

Configure a Big Data Batch Job using the Map Reduce framework

Procedure

  1. To configure your Job to run on your Amazon EMR cluster, open the Run view.
  2. In the Hadoop Configuration tab, you will use your cluster connection metadata. In the Property Type list, select Repository, then browse the Repository to find your Amazon EMR cluster metadata:
  3. Once designed and configured, you can run your Job.

    You can follow the execution of your Job in the Designer or in the Console:

    The Talend Studio allows you to convert Jobs from one framework to another.

    Because Spark allows faster in-memory processing, you may be interested in the article Converting a MapReduce Job to a Spark Job.