Use a Qubole distribution in Talend Studio
A Talend Studio from V7.0 onwards.
A Talend JobServer from V7.0 onwards (same version as the Studio).
- The Qubole distribution zip file downloaded from Talend Exchange.
Configure the JobServer
Configure the JobServer to be able to run your Job remotely in this JobServer.
A Talend JobServer is available only in a subscription-based Talend solution. If you are using a community solution, skip this section.
Create a JobServer cluster.
In the Studio V7.0, the default Java version in a JobServer is 1.8.
- Create an EC2 instance in the VPC where your Qubole cluster is running.
- Upload and run the JobServer.
In the Studio, define this JobServer as a remote server. To do this, open Preferences wizard, open and click the [+] button twice to add lines and edit them to make them look like something below:
, then in the
As it is displayed on this image, ec2-54-161-123-198.compute-1.amazonaws.com is an example of the location of your EC2 instance for the JobServer. In the Password column, leave the default values.
- Click Apply and then OK to validate the configuration.
The JobServer is now ready to be used to run your Job remotely.
Define the Qubole connection
- In the Repository tree view, right click Hadoop cluster under the Metadata node to open the contextual menu.
- Select Create Hadoop cluster.
- In the first step of the wizard, enter the descriptive information about the connection to be created, such as the name of this connection and its purpose.
- Click Next to open [Hadoop Configuration Import Wizard].
Select Enter manually Hadoop services and click
- From the Distribution drop-down list, select Custom.
Click the [...] button to import the Qubole zip file
- Click OK to validate the import.
Enter the connection information in the corresponding fields.
The parameters in this image are for demonstration purposes only. You need to enter the proper service locations of your Qubole cluster.
- Click Finish to validate the creation. This new connection appears under the Hadoop Cluster node in Repository.
- Right click this connection and in the contextual menu, select Create HDFS.
- Follow the wizard to create the connection to the HDFS service of your Qubole cluster. The connection parameters should have been inherited from the parent Qubole connection. So keep them as they are unless you need make further changes.
- At the last step of the wizard, click Check to verify the connection to the HDFS service. A message should appear to say that the connection is successful.
- Click Finish to validate the creation. This HDFS connection is displayed under the Qubole connection you previously defined in Repository
The Qubole conneciton is now ready to be used, for example, in a Talend Job.
Use the Qubole connection in a Spark Job
Use the Qubole connection previously defined in the Repository in a Talend Job.
In this example, a Job for Apache Spark is used. This type of Jobs are available only to a subscription-based Talend solution with Big Data. If you are using a community version of the Studio, you can create a Standard Job (a traditional Data Integration Job), for example with the HDFS components, to use this Qubole connection.
- Create a Spark Job and keep it open in the workspace of your Studio.
- Drop the Qubole HDFS connection from the Repository to this Spark Job. A component list pops up.
- Select the tHDFSConfiguration component.
If prompted, click OK to accept the update of the Hadoop
This will use your Qubole connection metadata to fill the fields in the Spark Configuration tab in the Run view of your Job.
- In the Spark Configuration tab, select the Spark version of your Qubole cluster. This version must be 2.0 or higher.
Depending on how Spark has been implemented in your Qubole cluster, select
YARN client or YARN cluster from
the Spark mode drop-down list.
If you are not sure of the mode to be selected, contact the administrator of your cluster.
If you are using a JobServer, do the following to use it for the Job.
- If you are using the Yarn client mode, select the Define the driver hostname or IP address check box and in the field that is displayed, enter the JobServer EC2 DNS address.
- If you are using the Yarn cluster mode, leave the Define the driver hostname or IP address check box clear.
- In the Target Exec tab of the Run view, select this JobServer.
The Qubole relevant configuration for your Job is done. Once you finish developing your Job, you can run it.
Qubole support matrix
The following table presents the supported and unsupported items of the Qubole configuration zip.
The term "supported" means Talend went through a complete QA validation process.
On the Qubole cluster side
On the Studio side
Known issue: a SPARK_HOME issue
When running a Spark Job, you may encounter the following issue.
[ERROR]: org.apache.spark.SparkContext - Error initializing SparkContext. java.util.NoSuchElementException: key not found: SPARK_HOME
Install a spark-client in the machine where this Job is executed to resolve this issue. This machine is typically the one the JobServer has been installed.
Stop the JobServer.
If you directly use the Studio to run your Job, stop the Studio.
Download the supported Spark version from Apache Spark.
In this example, download
Spark release = 2.0.2 (Nov 14 2016) package type = Pre-built for Apache Hadoop 2.6
Upload the zip file to the machine of the JobServer and unzip it to the
directory of your choice. For example:
If you directly use the Studio to run your Job, perform these operations on the machine where your Studio is installed.
Export the environment variable
export SPARK_HOMEcommand:If you have unzipped the downloaded Spark zip to /tmp/spark-2.0.2-bin-hadoop2.6, this command to be used is
Restart the JobServer.
If you directly use the Studio to run your Job, restart the Studio.