Skip to main content Skip to complementary content

Configuring and running your Spark Job with CDP Public Cloud Data Hub on AWS

Talend Studio allows you to deploy and execute your Spark Streaming and Spark Batch Jobs on a remote Talend JobServer with a CDP Public Cloud Data Hub on AWS instance.

Before you begin

Make sure:

Procedure

  1. Connect to your Cloudera Management console and go to the Data Hub Clusters tab and then Hardware tab.
  2. Make sure you have a gateway host available under the Gateway section. If no gateway is available, you must create a new one.
  3. Download the Talend JobServer to install it on the gateway.
  4. Connect to your AWS Management Console and from the VPC Management Console, make sure that the ports in the Inbound rules and Outbound rules tabs that you set up for the Talend JobServer are open.
  5. Connect to Cloudera Manager and from the Clusters tab, download all the configuration files from your cluster and unzip them all in the same path on your local machine.
  6. Connect to Talend Studio and set up manually the Hadoop connection using the Import configuration from local files option. For more information, see the third step in Setting up the Hadoop connection.
    Information noteNote:
    • You do not have to select any Cloudera version in the drop-down list. As Talend Studio uses the configuration files from the CDP Public Cloud instance clusters, it will use the runtime version defined in it.
    • You must enable SSL and Kerberos.
  7. Run your Job on the Talend JobServer. For more information, see Running a Job remotely.

Results

You are now able to use a CDP Public Cloud Data Hub on AWS instance with Talend Studio.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!