Defining the Cloudera Altus connection parameters - 7.3

Spark Batch

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
Last publication date
2024-02-21

Complete the Altus connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

Only the Yarn cluster mode is available for this type of cluster.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Before you begin

Prerequisites:

The Cloudera Altus client, Altus CLI, must be installed in the machine in which your Job is executed:

Procedure

  1. In the Spark configuration tab of the Run view of your Job, enter the basic connection information to Cloudera Altus.
    Use local timezone Select this check box to let Spark use the local timezone provided by the system.
    Note:
    • If you clear this check box, Spark use UTC timezone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Job run with DS which improves performance.
    • If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility.
    Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Note: Newly created Jobs in 7.3 use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.

    Force Cloudera Altus credentials

    Select this check box to provide the credentials with your Job.

    If you want to provide the credentials separately, for example manually using the command altus configure in your terminal, clear this check box.

    Path to Cloudera Altus CLI

    Enter the path to the Cloudera Altus client, which must have been installed and activated in the machine in which your Job is executed. In production environments, this machine is typically a Talend Jobserver.

  2. Configure the virtual Cloudera cluster to be used.

    Use an existing Cloudera Altus cluster

    Select this check box to use a Cloudera Altus cluster already existing in your Cloud service. Otherwise, leave this check box clear to allow the Job to create a cluster on the fly.

    With this check box selected, only the Cluster name parameter is useful and the other parameters for the cluster configuration are hidden.

    Cluster name

    Enter the name of the cluster to be used.

    Environment

    Enter the name of the Cloudera Altus environment to be used to describe the resources allocated to the given cluster.

    If you do not know which environment to select, contact your Cloudera Altus administrator.

    Delete cluster after execution

    Select this check box if you want to remove the given cluster after the execution of your Job.

    Override with a JSON configuration

    Select this check box to manually edit JSON code in the Custom JSON field that is displayed to configure the cluster.

    Instance type

    Select the instance type for the instances in the cluster. All nodes that are deployed in this cluster use the same instance type.

    Worker node

    Enter the number of worker nodes to be created for the cluster.

    For details about the allowed number of worker nodes, see the documentation of Cloudera Altus.

    Cloudera Manager username and Cloudera Manager password

    Enter the authentication information to your Cloudera Manager service.

    SSH private key

    Browse, or enter the path to the SSH private key in order to upload and register it in the region specified in the Cloudera Altus environment.

    The Data Engineering service of Cloudera Altus uses this private key to access and configure instances of the cluster to be used.

    Custom bootstrap script

    If you want to create a cluster with a bootstrap script you provide, browse, or enter the path to this script in the Custom Bootstrap script field.

    For an example of an Altus bootstrap script, see Install a custom Python environment when creating a cluster from the Cloudera documentation.

  3. From the Cloud provider list, select the Cloud service that runs your Cloudera Altus cluster.
    • If your cloud provider is AWS, select AWS and define the Amazon S3 directory in which you store your Job dependencies.

      AWS

      • Access key and Secret key: enter the authentication information required to connect to the Amazon S3 bucket to be used.

        To enter the password, click the [...] button next to the password field, and then in the pop-up dialog box enter the password between double quotes and click OK to save the settings.

      • Specify the AWS region by selecting a region name from the list or entering a region between double quotation marks (e.g. "us-east-1") in the list. For more information about the AWS Region, see Regions and Endpoints.

      • S3 bucket name: enter the name of the bucket to be used to store the dependencies of your Job. This bucket must already exist.

      • S3 storage path: enter the directory in which you want to store the dependencies of your Job in this given bucket, for example, altus/jobjar. This directory is created if it does not exist at runtime.

      The Amazon S3 you specify here is used to store your Job dependencies only. To connect to the S3 system which hosts your actual data, use a tS3Configuration component in your Job.

    • If your cloud provider is Azure, select Azure to store your Job dependencies in your Azure Data Lake Storage.

      1. In your Azure portal, assign the Read/Write/Execute permissions to the Azure application to be used by the Job to access your Azure Data Lake Storage. For details about how to assign permissions, see Azure documentation: Assign the Azure AD application to the Azure Data Lake Storage account file or folder. For example:

        Without appropriate permissions, your Job dependencies cannot be transferred to your Azure Data Lake Storage.

      2. In your Altus console, identify the Data Lake Storage AAD Group Name used by your Altus environment in the Instance Settings section.

      3. In your Azure portal, assign the Read/Write/Execute permissions to this AAD group using the same procedure explained in Azure documentation: Assign the Azure AD application to the Azure Data Lake Storage account file or folder.

        Without appropriate permissions, your Job dependencies cannot be transferred to your Azure Data Lake Storage.

      4. In the Spark configuration tab, configure the connection to your Azure Data Lake Storage.

        Azure (technical preview)

        • ADLS account FQDN:

          Enter the address without the scheme part of the Azure Data Lake Storage account to be used, for example, ychendls.azuredatalakestore.net.

          This account must already exist in your Azure portal.

        • Azure App ID and Azure App key:

          In the Client ID and the Client key fields, enter, respectively, the authentication ID and the authentication key generated upon the registration of the application that the current Job you are developing uses to access Azure Data Lake Storage.

          This application must be the one to which you assigned permissions to access your Azure Data Lake Storage in the previous step.

        • Token endpoint:

          In the Token endpoint field, copy-paste the OAuth 2.0 token endpoint that you can obtain from the Endpoints list accessible on the App registrations page on your Azure portal.

      The Azure Data Lake Storage you specify here is used to store your Job dependencies only. To connect to the Azure system which hosts your actual data, use a tAzureFSConfiguration component in your Job.

  4. Select the Wait for the Job to complete check box to make your Studio or, if you use Talend Jobserver, your Job JVM keep monitoring the Job until the execution of the Job is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While it is generally useful to select this check box when running a Spark Batch Job, it makes more sense to keep this check box clear when running a Spark Streaming Job.

Results