Convert a Map Reduce Job to a Spark Job
This example uses Talend Big Data Real Time Platform 6.1
The Talend Studio allows you to convert Jobs from one framework to another. Because Spark allows faster in-memory processing, we will convert a MapReduce Job to a Spark Job.
- In the Repository, right-click your Job and click Duplicate .
- Name your new Job. Then, in the Job Type , keep the Big Data Batch option
and in the Framework list, select Spark :
Note : Using the same procedure, you can duplicate your Job as a Standard, Big Data Batch or Big Data Streaming job. Next, depending on the Job type, you will be able to select the framework of your choice.
- Click OK . Your Job will appear in the Repository. Double-click the Job to open it in the Designer:
- A tHDFSConfiguration has been automatically added. Note that Spark does not depend on a particular file system. The file system used for storage must be defined using a specific component such as tHDFSConfiguration or tS3Configuration.
- To find the cluster connection information metadata, double-click the tHDFSConfiguration component.
- For Big Data Batch – MapReduce Jobs, the connection to the cluster is configured in the Run View. In the Run View, click the Spark Configuration tab. You will see the repository cluster connection information:
- Run your Job and follow the execution in the Designer or in the Console, in the same
way as for Big Data – MapReduce Jobs:
When converting a Job from one type to another, or from one framework to another, before running the Job, make sure that all components have been loaded successfully. Note that the Palette and the list of available components change, depending on the Job type and the framework used.