Defining Kubernetes connection parameters with Spark Universal - Cloud - 8.0

Talend Data Fabric Studio User Guide

Version
Cloud
8.0
Language
English (United States)
EnrichDitaval
Data Fabric
Product
Talend Data Fabric
Module
Talend Studio
Content
Design and Development

About this task

Complete the Spark Universal connection configuration with Kubernetes on Spark 3.1.x in the Spark configuration tab of the Run view of your Spark Job. This configuration is effective on a per-Job basis.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.

Procedure

  1. Click the Run view beneath the design workspace, then click the Spark configuration view.
  2. Select Built-in from the Property type drop-down list.
    If you have already set up the connection parameters in the Repository as explained in Centralizing a Hadoop connection, you can easily reuse it. To do this, select Repository from the Property type drop-down list, then click […] button to open the Repository Content dialog box and select the Hadoop connection to be used.
    Tip: Setting up the connection in the Repository allows you to avoid configuring that connection each time you need it in the Spark configuration view of your Jobs. The fields are automatically filled.
  3. Select Universal from the Distribution drop-down list, Spark 3.1.x from the Version drop-down list and Kubernetes from the Runtime mode/environment drop-down list.
  4. Complete the Kubernetes configuration parameters:
    Kubernetes master Enter the API Server Address respecting the following format: k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>. You can retrieve it using the kubectl config view --minify command in your command line interface.
    Number of executor instances Enter the number of executors to be used for the Job execution.
    Use registry secret Enter the password to access the Docker image, if needed.
    Docker image Enter the name of the Docker image to be used for the execution.
    Namespace Enter the namespace of the Docker cluster.
    Service account Enter the name of the service account to be used. The service account must have sufficient rights on the Kubernetes cluster.
    Cloud storage Select the Cloud provider you want to use from the drop-down list and enter the information and credentials in the corresponding fields.
    Cloud storage > S3 Set the following parameters to connect to S3:
    • Bucket
    • Path to folder
    • Credentials type
    • Access key
    • Secret key
    Cloud storage > Blob Set the following parameters to connect to Azure Blob Storage:
    • Path to folder
    • Storage account
    • Container name
    • Secret key
    Cloud storage > Adls gen 2 Set the the following parameters to connect to ADLS Gen 2:
    • Path to folder
    • Storage account
    • Credentials type
    • Container name
    • Secret key
  5. Enter the basic Configuration information:
    Use local timezone Select this check box to let Spark use the local timezone provided by the system.
    Note:
    • If you clear this check box, Spark use UTC timezone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Job run with DS which improves performance.
    • If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility.
    Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Note: Newly created Jobs in 7.3 or later use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
  6. In the Advanced properties table, add any Spark properties you need to use to override their default counterparts used by the Studio.

Results

The connection details to the Kubernetes cluster are complete, you are ready to schedule executions of your Job or to run it immediately from this cluster.