Use a Qubole distribution in Talend Studio

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs > Hadoop distributions
EnrichPlatform
Talend Studio

Use a Qubole distribution in Talend Studio

This article describes how to import a Qubole configuration zip in the Studio so as to work with a Qubole cluster.
Prerequisites:
  • A Talend Studio from V7.0 onwards.

  • A Talend JobServer from V7.0 onwards (same version as the Studio).

  • The Qubole distribution zip file downloaded from Talend Exchange.

Configure the JobServer

Configure the JobServer to be able to run your Job remotely in this JobServer.

A Talend JobServer is available only in a subscription-based Talend solution. If you are using a community solution, skip this section.

Procedure

  1. Create a JobServer cluster.
    1. Create an EC2 instance in the VPC where your Qubole cluster is running.
    2. Upload and run the JobServer.
    In the Studio V7.0, the default Java version in a JobServer is 1.8.
  2. In the Studio, define this JobServer as a remote server. To do this, open Window > Preferences, then in the Preferences wizard, open Talend > Run/Debug > Remote and click the [+] button twice to add lines and edit them to make them look like something below:

    As it is displayed on this image, ec2-54-161-123-198.compute-1.amazonaws.com is an example of the location of your EC2 instance for the JobServer. In the Password column, leave the default values.

  3. Click Apply and then OK to validate the configuration.

Results

The JobServer is now ready to be used to run your Job remotely.

Define the Qubole connection

Define the connection to your Qubole cluster in the Studio. This connection can be reused by different Talend items.

Procedure

  1. In the Repository tree view, right click Hadoop cluster under the Metadata node to open the contextual menu.
  2. Select Create Hadoop cluster.
  3. In the first step of the wizard, enter the descriptive information about the connection to be created, such as the name of this connection and its purpose.
  4. Click Next to open [Hadoop Configuration Import Wizard].
  5. Select Enter manually Hadoop services and click Finish.
  6. From the Distribution drop-down list, select Custom.
  7. Click the [...] button to import the Qubole zip file (QuboleExchange.zip).
  8. Click OK to validate the import.
  9. Enter the connection information in the corresponding fields.

    The parameters in this image are for demonstration purposes only. You need to enter the proper service locations of your Qubole cluster.

  10. Click Finish to validate the creation. This new connection appears under the Hadoop Cluster node in Repository.
  11. Right click this connection and in the contextual menu, select Create HDFS.
  12. Follow the wizard to create the connection to the HDFS service of your Qubole cluster. The connection parameters should have been inherited from the parent Qubole connection. So keep them as they are unless you need make further changes.
  13. At the last step of the wizard, click Check to verify the connection to the HDFS service. A message should appear to say that the connection is successful.
  14. Click Finish to validate the creation. This HDFS connection is displayed under the Qubole connection you previously defined in Repository

Results

The Qubole conneciton is now ready to be used, for example, in a Talend Job.

Use the Qubole connection in a Spark Job

Use the Qubole connection previously defined in the Repository in a Talend Job.

In this example, a Job for Apache Spark is used. This type of Jobs are available only to a subscription-based Talend solution with Big Data. If you are using a community version of the Studio, you can create a Standard Job (a traditional Data Integration Job), for example with the HDFS components, to use this Qubole connection.

Procedure

  1. Create a Spark Job and keep it open in the workspace of your Studio.
  2. Drop the Qubole HDFS connection from the Repository to this Spark Job. A component list pops up.
  3. Select the tHDFSConfiguration component.
  4. If prompted, click OK to accept the update of the Hadoop configuration.

    This will use your Qubole connection metadata to fill the fields in the Spark Configuration tab in the Run view of your Job.

  5. In the Spark Configuration tab, select the Spark version of your Qubole cluster. This version must be 2.0 or higher.
  6. Depending on how Spark has been implemented in your Qubole cluster, select YARN client or YARN cluster from the Spark mode drop-down list.

    If you are not sure of the mode to be selected, contact the administrator of your cluster.

  7. If you are using a JobServer, do the following to use it for the Job.
    1. If you are using the Yarn client mode, select the Define the driver hostname or IP address check box and in the field that is displayed, enter the JobServer EC2 DNS address.
    2. If you are using the Yarn cluster mode, leave the Define the driver hostname or IP address check box clear.
    3. In the Target Exec tab of the Run view, select this JobServer.

Results

The Qubole relevant configuration for your Job is done. Once you finish developing your Job, you can run it.

Qubole support matrix

The following table presents the supported and unsupported items of the Qubole configuration zip.

The term "supported" means Talend went through a complete QA validation process.

 

Supported

Unsupported

On the Qubole cluster side

  • Spark V2.0.2

  • Hadoop cluster: 2.6.0

  • Java 8

  • Hive V2.1.1

  • Hive on Spark

  • Pig

  • HCatalog

  • HBase

On the Studio side

  • Java 8

  • Components in a Standard Job:

    • HDFS

    • Hive

  • MapReduce Jobs

  • Spark Jobs

  • Hive on Spark

  • Pig

  • HCatalog

  • HBase

  • Redshift

  • DynamoDB

  • In a Standard Job:

    • Hive Parquet file format is not supported

Known issue: a SPARK_HOME issue

When running a Spark Job, you may encounter the following issue.

[ERROR]: org.apache.spark.SparkContext - Error initializing SparkContext.
java.util.NoSuchElementException: key not found: SPARK_HOME

Install a spark-client in the machine where this Job is executed to resolve this issue. This machine is typically the one the JobServer has been installed.

Procedure

  1. Stop the JobServer.

    If you directly use the Studio to run your Job, stop the Studio.

  2. Download the supported Spark version from Apache Spark.
    In this example, download
    Spark release = 2.0.2 (Nov 14 2016)
    package type = Pre-built for Apache Hadoop 2.6
  3. Upload the zip file to the machine of the JobServer and unzip it to the directory of your choice. For example: /tmp/spark-2.0.2-bin-hadoop2.6.

    If you directly use the Studio to run your Job, perform these operations on the machine where your Studio is installed.

  4. Export the environment variable SPARK_HOME using the export SPARK_HOME command:
    If you have unzipped the downloaded Spark zip to /tmp/spark-2.0.2-bin-hadoop2.6, this command to be used is
    export SPARK_HOME=/tmp/spark-2.0.2-bin-hadoop2.6
  5. Restart the JobServer.

    If you directly use the Studio to run your Job, restart the Studio.