Defining the AWS Qubole connection parameters for Spark Jobs - 7.3

Spark Streaming

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Streaming
Last publication date
2024-02-21

Complete the Qubole connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

Qubole is supported only in the traditional data integration framework (the Standard framework) and in the Spark frameworks.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Before you begin

  • You have properly set up your Qubole cluster on AWS. For further information about how to do this, see Getting Started with Qubole on AWS from the Qubole documentation.
  • Ensure that the Qubole account to be used has the proper IAM role that is allowed to read/write to the S3 bucket to be used. For further details, contact the administrator of your Qubole system or see Cross-account IAM Role for QDS from the Qubole documentation.
  • Ensure that the AWS account to be used has the proper read/write permissions to the S3 bucket to be used. For this purpose, contact the administrator of your AWS system.

Procedure

  1. Enter the basic configuration information:
    Use local timezone Select this check box to let Spark use the local timezone provided by the system.
    Note:
    • If you clear this check box, Spark use UTC timezone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Job run with DS which improves performance.
    • If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility.
    Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Note: Newly created Jobs in 7.3 use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
  2. Enter the basic connection information to Qubole:

    Connection configuration

    • Click the ... button next to the API Token field to enter the authentication token generated for the Qubole user account to be used. For further information about how to obtain this token, see Manage Qubole account from the Qubole documentation.

      This token allows you to specify the user account you want to use to access Qubole. Your Job automatically uses the rights and permissions granted to this user account in Qubole.

    • Select the Cluster label check box and enter the name of the Qubole cluster to be used. If leaving this check box clear, the default cluster is used.

      If you need details about your default cluster, ask the administrator of your Qubole service. You can also read this article from the Qubole documentation to find more information about configuring a default Qubole cluster.

    • Select the Change API endpoint check box and select the region to be used. If leaving this check box clear, the default region is used.

      For further information about the Qubole Endpoints supported on QDS-on-AWS, see Supported Qubole Endpoints on Different Cloud Providers.

  3. Configure the connection to the S3 file system to be used to temporarily store the dependencies of your Job so that your Qubole cluster has access to these dependencies.
    This configuration is used for your Job dependencies only. Use a tS3Configuration in your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster.
    • Access key and Secret key: enter the authentication information required to connect to the Amazon S3 bucket to be used.

      To enter the password, click the [...] button next to the password field, and then in the pop-up dialog box enter the password between double quotes and click OK to save the settings.

    • Bucket name: enter the name of the bucket in which you want to store the dependencies of your Job. This bucket must already exist on S3.
    • Temporary resource folder: enter the directory in which you want to store the dependencies of your Job. For example, enter temp_resources to write the dependencies in the /temp_resources folder in the bucket.

      If this folder already exists at runtime, its contents are overwritten by the upcoming dependencies; otherwise, this folder is automatically created.

    • Region: specify the AWS region by selecting a region name from the list. For more information about the AWS Region, see Regions and Endpoints.

Results

  • After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in:
  • If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.

    For more information about the Spark checkpointing operation, see the official Spark documentation.