Amazon EMR - Big Data Batch Jobs
- Amazon EC2
- Amazon EMR
- Amazon RDS
A common requirement is to supplement data in a MySQL database with data from another source, such as a file hosted on HDFS, or a file on the local file system.
Create a Big Data Batch Job
- In the Repository, right-click Job designs and click Create Big Data Batch Job.
- In the wizard, you will provide Name, Purpose and Description, as usual.
To define the Job as a Spark Batch job, in the Framework
list, select Spark Batch:
Then you will be able to design your Job as any other Talend Job.
Configure a MapReduce framework (deprecated)
- To configure your Job to run on your Amazon EMR cluster, open the Run view.
In the Hadoop Configuration tab, you will use your
cluster connection metadata. In the Property Type list,
select Repository, then browse the Repository to find
your Amazon EMR cluster metadata:
Once designed and configured, you can run your Job.
You can follow the execution of your Job in the Designer or in the Console:
The Talend Studio allows you to convert Jobs from one framework to another.
Because Spark allows faster in-memory processing, you may be interested in the article Converting a MapReduce Job to a Spark Job.