Defining the Cloudera Altus connection parameters (technical preview)

Spark Batch

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Big Data
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Big Data Platform
task
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
EnrichPlatform
Talend Studio

Complete the Altus connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

Only the Yarn cluster mode is available for this type of cluster.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data, but it is not applicable to Talend Open Studio for Big Data users.

Before you begin

The Cloudera Altus client, Altus CLI, must be installed in the machine in which your Job is executed:

Procedure

  1. In the Spark configuration tab of the Run view of your Job, enter the basic connection information to Cloudera Altus.

    Force Cloudera Altus credentials

    Select this check box to provide the credentials with your Job.

    If you want to provide the credentials separately, for example manually using the command altus configure in your terminal, clear this check box.

    Path to Cloudera Altus CLI

    Enter the path to the Cloudera Altus client, which must have been installed and activated in the machine in which your Job is executed. In production environments, this machine is typically a Talend Jobserver.

  2. Configure the virtual Cloudera cluster to be used.

    Use an existing Cloudera Altus cluster

    Select this check box to use a Cloudera Altus cluster already existing in your Cloud service. Otherwise, leave this check box clear to allow the Job to create a cluster on the fly.

    With this check box selected, only the Cluster name parameter is useful and the other parameters for the cluster configuration are hidden.

    Cluster name

    Enter the name of the cluster to be used.

    Environment

    Enter the name of the Cloudera Altus environment to be used to describe the resources allocated to the given cluster.

    If you do not know which environment to select, contact your Cloudera Altus administrator.

    Delete cluster after execution

    Select this check box if you want to remove the given cluster after the execution of your Job.

    Override with a JSON configuration

    Select this check box to manually edit JSON code in the Custom JSON field that is displayed to configure the cluster.

    Instance type

    Select the instance type for the instances in the cluster. All nodes that are deployed in this cluster use the same instance type.

    Worker node

    Enter the number of worker nodes to be created for the cluster.

    For details about the allowed number of worker nodes, see the documentation of Cloudera Altus.

    Cloudera Manager username and Cloudera Manager password

    Enter the authentication information to your Cloudera Manager service.

    SSH private key

    Browse, or enter the path to the SSH private key in order to upload and register it in the region specified in the Cloudera Altus environment.

    The Data Engineering service of Cloudera Altus uses this private key to access and configure instances of the cluster to be used.

  3. From the Cloud provider list, select the Cloud service that runs your Cloudera Altus cluster. Currently, only AWS is available.

    AWS

    • Access key and Secret key: enter the authentication information required to connect to the Amazon S3 bucket to be used.

      To enter the password, click the [...] button next to the password field, and then in the pop-up dialog box enter the password between double quotes and click OK to save the settings.

    • Specify the AWS region by selecting a region name from the list or entering a region between double quotation marks (e.g. "us-east-1") in the list. For more information about the AWS Region, see Regions and Endpoints.

    • S3 bucket name: enter the name of the bucket to be used to store the dependencies of your Job. This bucket must already exist.

    • S3 storage path: enter the directory in which you want to store the dependencies of your Job in this given bucket, for example, altus/jobjar. This directory is created if it does not exist at runtime.

    The Amazon S3 you specify here is used to store your Job dependencies only. To connect to the S3 system which hosts your actual data, use a tS3Configuration component in your Job

Results

After the connection is configured, you can tune the Spark performance, although not required, by following the process explained in: