Defining Amazon EMR connection parameters with Spark Universal - Cloud - 8.0

Talend Studio User Guide

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-02-29
Available in...

Big Data

Big Data Platform

Cloud Big Data

Cloud Big Data Platform

Cloud Data Fabric

Data Fabric

Real-Time Big Data Platform

When you run your Spark Jobs on YARN cluster using Amazon EMR distribution, you need to manually distribute the libraries as Amazon EMR does not have the same classpath on main and subordinate nodes.

About this task

Complete the following actions using a command prompt to distribute the libraries between main and subordinate nodes.

Procedure

  1. Upload the PEM file to the cluster:
    
    scp -i username_EC2.pem sanulich_EC2.pem hadoop@<mainNode>:/home/hadoop
  2. Confirm that the PEM file has the correct permissions:
    ssh -i username_EC2.pem hadoop@<mainNode>
    ls -al
    The correct permissions must be as follows:
     -r--------  1 username username    1674 кві 11 16:26  username_EC2.pem
  3. Optional: If the PEM file does not have the correct permissions, change the permissions as follows:
    
    chmod -rwx username_EC2.pem
    chmod  u+r username_EC2.pem
  4. Go to your Amazon EMR instance, and find the hostname of the subordinate nodes.
  5. Copy the JAR files from main to subordinates nodes:
    scp -i /home/hadoop/username_EC2.pem  /usr/lib/spark/jars/*.jar hadoop@<slaveNode>:/home/hadoop
  6. Connect to each subordinates nodes from main nodes:
    ssh -i /home/hadoop/username_EC2.pem hadoop@<slaveNode>
  7. Move the JAR file:
    sudo mv /home/hadoop/*.jar /usr/lib/spark/jars
  8. Open Talend Studio and then open your Spark Job.
  9. Click the Run view beneath the design workspace, then click the Spark configuration view.
  10. In the Advanced properties table, add the "spark.hadoop.dfs.client.use.datanode.hostname" property with the "true" value.

Results

Your Spark Job is correctly configured to run in YARN cluster mode with Amazon EMR distribution.