Defining the Azure Synapse Analytics connection parameters - Cloud - 8.0

Spark Batch

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
Last publication date
2024-02-20

Complete the Azure Synapse Analytics connection configuration in the Spark configuration tab of the Run view of your Spark Batch Job running on Spark 3.1.x. This configuration is effective on a per-Job basis.

Only the Yarn cluster mode is available for this type of cluster.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Important: Spark Pools is the only service of Azure Synapse Analytics supported for Spark Jobs in Talend Studio.

Before you begin

You must already have a Synapse workspace and an Apache Spark pool set up. For more information, see Creating a Synapse workspace and Create a new serverless Apache Spark pool using the Azure portal from the official Microsoft Documentation.

Procedure

  1. Enter the basic configuration information to connect to Azure Synapse:
    Endpoint Enter the Development endpoint from you Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace.
    Authorization token Enter the token generated for your Azure Synapse account.
    Note: To generate a token you need to enter the following command:curl -X post -H "Content-Type: application/x-www-form-urlencoded" -d 'client_id=<YourClientID>&scope=https://dev.azuresynapse.net/.default&client_secret=<YourClientSecret>&grant_type=client_credentials' 'https://login.microsoftonline.com/<YourTenantID>/oauth2/v2.0/token'

    You can retrieve your Client ID, Client Secret and Tenant ID from your Azure Portal.

    The authentication to Azure Synapse is performed via Azure Active Directory application. For more information on how to register to Azure Active Directory, see Use the portal to create an Azure AD application and service principal that can access resources from the official Microsoft documentation.

    Important: The token is only valid for one hour. You must regenerate a new one beyond the one-hour limit otherwise you could get an error (401 - Not authorized).
    Apache Spark pools Enter, in double quotation marks, the name of the Apache Spark Pool to be used.
    Note: On Azure Synapse workspace side, make sure that:
    • the Autoscale option in Basic settings and the Automatic pausing option in Additional settings are enabled when creating an Apache Spark pool
    • the selected Apache Spark version is set to 3.0 (preview)
    Poll interval when retrieving Job status (in ms) Enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want Talend Studio to ask Spark for the status of your Job.

    The default value is 3000, meaning 30 seconds.

    Maximum number of consecutive statuses missing Enter the maximum number of times Talend Studio should retry to get a status when there is no status response.

    The default value is 10.

  2. Enter the basic storage information of Azure Synapse:
    Authentication method Select the authentication mode to be used from the drop-down list:
    • Secret Key
    • Azure Active Directory
    Storage Select the storage to be used in the drop-down list.

    ADLS Gen2 is the default storage for Azure Synapse Analytics workspace. If you are using Azure Active Directory authentication, make sure the application is linked to ADLS Gen2 with granted role Storage Blob Data Contribution.

    Hostname Enter the Primary ADLS Gen2 account from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace.
    Container Enter the Primary ADLS Gen2 file storage from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace.
    Username Enter the storage account name linked to your Azure Synapse workspace.

    This property is only available when you select Secret Key from the Authentication method drop-down list.

    Password Enter the access keys linked to your Azure Synapse workspace.

    For more information about how to retrieve your access keys, see View account access keys from the official Microsoft documentation.

    This property is only available when you select Secret Key from the Authentication method drop-down list.

    Directory ID Enter the directory ID linked to your Azure Active Directory application. You can retrieve your ID from the Azure Active Directory > Overview tab of your Azure portal.

    This property is only available when you select Azure Active Directory from the Authentication method drop-down list.

    Application ID Enter the application ID linked to your Azure Active Directory application. You can retrieve your ID from the Azure Active Directory > Overview tab of your Azure portal.

    This property is only available when you select Azure Active Directory from the Authentication method drop-down list.

    Use certificate to authenticate Select this check box to authenticate to your Azure Active Directory application using a certificate and then enter the location in which the certificate is stored in the Path to certificate field.

    Make sure you upload the certificate in the Certificates & secrets > Certificates section of your Azure Active Directory application. For more information about certificates, see the official Microsoft documentation.

    This property is only available when you select Azure Active Directory from the Authentication method drop-down list.

    Client key Enter the client key linked to your Azure Active Directory application. You can generate the client key from the Certificates & secrets tab of your Azure portal.

    This property is only available when you select Azure Active Directory from the Authentication method drop-down list and when you clear the Use certificate to authentication check box.

    Deployment Blob Enter the location in which you want to store the current Job and its dependent libraries in your storage.
  3. Enter the basic configuration information:
    Use local timezone Select this check box to let Spark use the local time zone provided by the system.
    Note:
    • If you clear this check box, Spark use UTC time zone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits time zone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Spark Batch Job run with DS which improves performance.
    • If you clear the check box, the components inside the Spark Batch Job run with RDD which means the Job remains unchanged. This ensures the backwards compatibility.

    This check box is selected by default, but if you import a Job from 7.3 backwards, the check box will be cleared as those Jobs run with RDD.

    Important: If your Spark Batch Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
    Batch size (ms) Enter the time interval at the end of which the Spark Streaming Job reviews the source data to identify changes and processes the new micro batches.
    Define a streaming timeout (ms) Select this check box and in the field that is displayed, enter the time frame at the end of which the Spark Streaming Job automatically stops running.
    Note: If you are using Windows 10, it is recommended to set up a reasonable timeout to avoid Windows Service Wrapper to have issue when sending signal termination from Java applications. If you are facing such issue, you can also manually cancel the Job from your Azure Synapse workspace.
  4. Select the Set tuning properties check box to define the tuning parameters, by following the process explained in Tuning Spark for Apache Spark Batch Jobs.
    Important: You must define the tuning parameters otherwise you could get an error (400 - Bad request).
  5. In the Spark "scratch" directory field, enter the directory in which Talend Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. As a result, if you leave /tmp in this field, this directory is C:/tmp.
  6. Select the Wait for the Job to complete check box to make Talend Studio or, if you use Talend JobServer, your Job JVM keep monitoring the Job until the execution of the Job is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While it is generally useful to select this check box when running a Spark Batch Job, it makes more sense to keep this check box clear when running a Spark Streaming Job.

Results

You can retrieve the Job results on your Azure Synapse workspace with the Livy ID generated when running you Job.