Setting up the Remote Engine Gen2 for Amazon EMR - Cloud

Talend Remote Engine Gen2 Quick Start Guide

EnrichVersion
Cloud
EnrichProdName
Talend Cloud
EnrichPlatform
Talend Management Console
Talend Pipeline Designer
task
Deployment > Deploying > Executing Pipelines
Installation and Upgrade

For the Remote Engine Gen2 AMI to work with Amazon EMR, the Livy instance running within your Remote Engine need the Hadoop configuration files coming from the target EMR cluster.

Note: EMR run profiles are only available upon customer request. Contact the Talend support for more information.

Before you begin

  • You have an EMR cluster that is running in the same VPC as your Remote Engine Gen2. For more information on how to create your security group and VPC, see creating-the-remote-engine-gen2-using-aws-cloudformation_c.html.
  • The Remote Engine Gen2 needs access to both the main and secondary instances of the EMR. You can either set up the Security Groups of the EMR instances to give full access to the security group of the engine, or only open the YARN services ports that can be found in the Amazon documentation.
  • You must use the root user to submit the pipelines through the Livy server.
  • The following impersonation parameters must be defined on the cluster side, in the core-site.xml file:
    <property>
      <name>hadoop.proxyuser.root.groups</name>
      <value>*</value>
    </property>
    <property>
      <name>hadoop.proxyuser.root.hosts</name>
      <value>*</value>
    </property>
    • You can either configure the cluster when creating it as described in the Amazon documentation with the following parameters:
      [
        {
          "Classification": "core-site",
          "Properties": {
            "hadoop.proxyuser.root.hosts": "*",
            "hadoop.proxyuser.root.groups": "*"
          }
        }
      ]
    • or you can add this parameter after the cluster start-up, in this case you need to re-start it as described in the Amazon documentation to take the changes into account.
  • Port information: make sure the Remote Engine Gen2 application allows connection to port 9005 from the EMR instances.

Procedure

  1. Copy the following configuration files from /etc/hadoop/conf to the /opt/talend/data/etc/hadoop folder in the Remote Engine Gen2 client instance:
    • hdfs-site.xml
    • mapred-site.xml
    • yarn-site.xml
    • core-site.xml
  2. On the Remote Engine Gen2 side, edit the following files as follow to match the Talend Pipeline Designer requirements:
    • core-site.xml:

      Edit the property io.compression.codecs with the value
      org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec
    • hdfs-site.xml:

      Add the property dfs.client.use.datanode.hostname with the value true

    • yarn-site.xml:

      Edit the property yarn.timeline-service.enabled with the value false

  3. On the EMR cluster side, create the /user/talend folder in HDFS:
    hadoop fs -mkdir -p /user/talend
    hadoop fs -chown -R talend:talend /user/talend
    Note: You need to connect to the main node with SSH using the hadoop user that has the rights to execute these commands.
  4. On the EMR cluster side, create two folders in the location of your choice and give them root:root ownership.

    Example

    hadoop fs -mkdir -p /talend/deps/pdesigner
    hadoop fs -mkdir -p /talend/deps/runtime
    hadoop fs -chown -R root:root /talend/deps
    Note: You need to give the root user ownership on these folders so that libraries can be uploaded with the emr-init.sh script.
  5. Back on the Remote Engine Gen2 side, export the following environment variables:
    • HDFS_DSS_DEPENDENCIES_PATH to /talend/deps/pdesigner
    • HDFS_RUNTIME_DEPENDENCIES_PATH to /talend/deps/runtime
  6. Execute the script emr-init.sh located in /opt/talend/emr.
  7. Restart Livy by executing this command:
    cd /opt/talend && docker-compose restart livy

What to do next

  • Connect to Talend Cloud Management Console, go to the Engines page and create a Big Data Run Profile.
  • Define the following mandatory properties in the Run profile (the values may differ depending on your setup):
    spark.yarn.archive=hdfs:///talend/deps/runtime/spark-runtime.zip
    spark.dss.dependencies.path=hdfs:///talend/deps/pdesigner
  • Link the Run profile to the Remote Engine Gen2 that has been previously set up.