Defining the Azure Synapse Analytics connection parameters - 7.3

Spark Batch

Version
7.3
Language
English (United States)
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Batch

Complete the Azure Synapse Analytics connection configuration in the Spark configuration tab of the Run view of your Job. This configuration is effective on a per-Job basis.

Only the Yarn cluster mode is available for this type of cluster.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.

Important: Spark Pools is the only service of Azure Synapse Analytics supported for Spark Jobs in Talend Studio.

Before you begin

You must already have a Synapse workspace and an Apache Spark pool set up. For more information, see Creating a Synapse workspace and Create a new serverless Apache Spark pool using the Azure portal from the official Microsoft Documentation.

About this task

Procedure

  1. Enter the basic connection information to Azure Synapse:
    Endpoint Enter the Development endpoint from you Azure Synpase account. You can find it in the Overview section of your Azure Synapse workspace.
    Authorization token Enter the token generated for your Azure Synapse account.
    Note: To generate a token you need to enter the following command:curl -X post -H "Content-Type: application/x-www-form-urlencoded" -d 'client_id=<YourClientID>&scope=https://dev.azuresynapse.net/.default&client_secret=<YourClientSecret>&grant_type=client_credentials' 'https://login.microsoftonline.com/<YourTenantID>/oauth2/v2.0/token'

    You can retrieve your Client ID, Client Secret and Tenant ID from your Azure Portal.

    The authentication to Azure Synapse is performed via Azure Active Directory application. For more information on how to register to Azure Active Directory, see Use the portal to create an Azure AD application and service principal that can access resources from the official Microsoft documentation.

    Important: The token is only valid for one hour. You must regenerate a new one beyond the one-hour limit otherwise you could get an error (401 - Not authorized).
    Apache Spark pools Enter, in double quotation marks, the name of the Apache Spark Pool to be used.
    Note: On Azure Synapse workspace side, make sure that:
    • the Autoscale option in Basic settings and the Automatic pausing option in Additional settings are enabled when creating an Apache Spark pool
    • the selected Apache Spark version is set to 3.0 (preview)
    Poll interval when retrieving Job status (in ms) Enter, without the quotation marks, the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the status of your Job.

    The default value is 3000, meaning 30 seconds.

    Maximum number of consecutive statuses missing Enter the maximum number of times the Studio should retry to get a status when there is no status response.

    The default value is 10.

  2. Enter the basic storage information of Azure Synapse:
    Storage Select the storage to be used in the drop-down list.

    ADLS Gen2 is the default storage for Azure Synapse Analytics workspace

    Hostname Enter the Primary ADLS Gen2 account from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace.
    Container Enter the Primary ADLS Gen2 file storage from your Azure Synapse account. You can find it in the Overview section of your Azure Synapse workspace.
    Username Enter the storage account name linked to your Azure Synapse workspace.
    Password Enter the access keys linked to your Azure Synapse workspace.

    For more information about how to retieve your access keys, see View account access keys from the official Microsoft documentation.

    Deployment Blob Enter the location in which you want to store the current Job and its dependent libraries in your storage.
  3. Enter the basic configuration information:
    Use local timezone Select this check box to let Spark use the local timezone provided by the system.
    Note:
    • If you clear this check box, Spark use UTC timezone.
    • Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
    Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API:
    • If you select the check box, the components inside the Spark Batch Job run with DS which improves performance.
    • If you clear the check box, the components inside the Spark Batch Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility.
    Important: If your Spark Batch Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.
    Note: Newly created Spark Batch Jobs in 7.3 use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.
    Use timestamp for dataset components Select this check box to use java.sql.Timestamp for dates.
    Note: If you leave this check box clear, java.sql.Timestamp or java.sql.Date can be used depending on the pattern.
    Batch size (ms) Enter the time interval at the end of which the Spark Streaming Job reviews the source data to identify changes and processes the new micro batches.
    Define a streaming timeout (ms) Select this check box and in the field that is displayed, enter the time frame at the end of which the Spark Streaming Job automatically stops running.
    Note: If you are using Windows 10, it is recommended to set up a reasonable timeout to avoid Windows Service Wrapper to have issue when sending signal termination from Java applications. If you are facing such issue, you can also manually cancel the Job from your Azure Synapse workspace.
  4. Select the Set tuning properties check box to define the tuning parameters, by following the process explained in:
    Important: You must define the tuning parameters otherwise you could get an error (400 - Bad request).
  5. In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
  6. Select the Wait for the Job to complete check box to make your Studio or, if you use Talend Jobserver, your Job JVM keep monitoring the Job until the execution of the Job is over. By selecting this check box, you actually set the spark.yarn.submit.waitAppCompletion property to be true. While it is generally useful to select this check box when running a Spark Batch Job, it makes more sense to keep this check box clear when running a Spark Streaming Job.

Results

You can retrieve the Job results on your Azure Synapse workspace with the Livy ID generated when running you Job.