Selecting the Spark mode - 7.2

Talend Data Fabric Getting Started Guide

author
Talend Documentation Team
EnrichVersion
7.2
EnrichProdName
Talend Data Fabric
task
Data Quality and Preparation > Cleansing data
Data Quality and Preparation > Profiling data
Design and Development
Installation and Upgrade
EnrichPlatform
Talend Administration Center
Talend DQ Portal
Talend Installer
Talend Runtime
Talend Studio

Depending on the Spark cluster to be used, select a Spark mode for your Job.

The Spark documentation provides an exhaustive list of Spark properties and their default values at Spark Configuration. A Spark Job designed in the Studio uses this default configuration except for the properties you explicitly defined in the Spark Configuration tab or the components used in your Job.

Procedure

  1. Click Run to open its view and then click the Spark Configuration tab to display its view for configuring the Spark connection.
  2. Select the Use local mode check box to test your Job locally.

    In the local mode, the Studio builds the Spark environment in itself on the fly in order to run the Job in. Each processor of the local machine is used as a Spark worker to perform the computations.

    In this mode, your local file system is used; therefore, deactivate the configuration components such as tS3Configuration or tHDFSConfiguration that provides connection information to a remote file system, if you have placed these components in your Job.

    You can launch your Job without any further configuration.

  3. Clear the Use local mode check box to display the list of the available Hadoop distributions and from this list, select the distribution corresponding to your Spark cluster to be used.
    This distribution could be:
    • Databricks

    • Qubole

    • Amazon EMR

      For this distribution, Talend supports:
      • Yarn client

      • Yarn cluster

    • Cloudera

      For this distribution, Talend supports:
      • Standalone

      • Yarn client

      • Yarn cluster

    • Google Cloud Dataproc

      For this distribution, Talend supports:
      • Yarn client

    • Hortonworks

      For this distribution, Talend supports:
      • Yarn client

      • Yarn cluster

    • MapR

      For this distribution, Talend supports:
      • Standalone

      • Yarn client

      • Yarn cluster

    • Microsoft HD Insight

      For this distribution, Talend supports:
      • Yarn cluster

    • Cloudera Altus

      For this distribution, Talend supports:
      • Yarn cluster

        Your Altus cluster should run on the following Cloud providers:
        • Azure

          The support for Altus on Azure is a technical preview feature.

        • AWS

    As a Job relies on Avro to move data among its components, it is recommended to set your cluster to use Kryo to handle the Avro types. This not only helps avoid this Avro known issue but also brings inherent preformance gains. The Spark property to be set in your cluster is:
    spark.serializer org.apache.spark.serializer.KryoSerializer

    If you cannot find the distribution corresponding to yours from this drop-down list, this means the distribution you want to connect to is not officially supported by Talend . In this situation, you can select Custom, then select the Spark version of the cluster to be connected and click the [+] button to display the dialog box in which you can alternatively:

    1. Select Import from existing version to import an officially supported distribution as base and then add other required jar files which the base distribution does not provide.

    2. Select Import from zip to import the configuration zip for the custom distribution to be used. This zip file should contain the libraries of the different Hadoop/Spark elements and the index file of these libraries.

      In Talend Exchange, members of Talend community have shared some ready-for-use configuration zip files which you can download from this Hadoop configuration list and directly use them in your connection accordingly. However, because of the ongoing evolution of the different Hadoop-related projects, you might not be able to find the configuration zip corresponding to your distribution from this list; then it is recommended to use the Import from existing version option to take an existing distribution as base to add the jars required by your distribution.

      Note that custom versions are not officially supported by Talend . Talend and its community provide you with the opportunity to connect to custom versions from the Studio but cannot guarantee that the configuration of whichever version you choose will be easy. As such, you should only attempt to set up such a connection if you have sufficient Hadoop and Spark experience to handle any issues on your own.

    For a step-by-step example about how to connect to a custom distribution and share this connection, see Hortonworks.