Big Data
Big Data Platform
Cloud Big Data
Cloud Big Data Platform
Cloud Data Fabric
Data Fabric
Real-Time Big Data Platform
About this task
Talend Studio connects to a Yarn cluster to run the Job from this cluster.
Complete the Spark Universal connection configuration in Yarn cluster mode on either Spark 2.4.x, 3.0.x or 3.1.x in the Spark configuration tab of the Run view of your Spark Job. This configuration is effective on a per-Job basis.
Procedure
- Click the Run view beneath the design workspace, then click the Spark configuration view.
-
Select Built-in from the Property
type drop-down list.
If you have already set up the connection parameters in the Repository as explained in Centralizing a Hadoop connection, you can easily reuse it. To do this, select Repository from the Property type drop-down list, then click […] button to open the Repository Content dialog box and select the Hadoop connection to be used.Tip: Setting up the connection in the Repository allows you to avoid configuring that connection each time you need it in the Spark configuration view of your Jobs. The fields are automatically filled.
- Select Universal from the Distribution drop-down list, any Spark version from the Version drop-down list and Yarn cluster from the Runtime mode/environment drop-down list.
-
Specify the path to the Hadoop configuration JAR file that provides the connection
parameters of the Yarn cluster you want to use. The JAR file contains all the
necessary information to establish the connection with all the
*-site.xml
files of the cluster.The JAR file must include the following XML files:hdfs-site.xml
core-site.xml
yarn-site.xml
mapred-site.xml
If you use Hive or HBase components, the JAR file must include in addition the following XML files accordingly:hive-site.xml
hbase-site.xml
-
If you need to launch your Spark Job from Windows, specify where the
winutils.exe program to be used is stored:
-
If you know where to find your winutils.exe file and you want to use it, select the Define the Hadoop home directory check box and enter the directory where your winutils.exe is stored.
-
Otherwise, leave the Define the Hadoop home directory check box clear, the Studio generates one by itself and automatically uses it for this Job.
-
-
Enter the basic configuration information:
Use local timezone Select this check box to let Spark use the local timezone provided by the system. Note:- If you clear this check box, Spark use UTC timezone.
- Some components also have the Use local timezone for date check box. If you clear the check box from the component, it inherits timezone from the Spark configuration.
Use dataset API in migrated components Select this check box to let the components use Dataset (DS) API instead of Resilient Distributed Dataset (RDD) API: - If you select the check box, the components inside the Job run with DS which improves performance.
- If you clear the check box, the components inside the Job run with RDD which means the Job remains unchanged. This ensure the backwards compatibility.
Important: If your Job contains tDeltaLakeInput and tDeltaLakeOutput components, you must select this check box.Note: Newly created Jobs in 7.3 or later use DS and imported Jobs from 7.3 or earlier use RDD by default. However, not all the components are migrated from RDD to DS so it is recommended to clear the check box to avoid any errors by default.Use timestamp for dataset components Select this check box to use java.sql.Timestamp
for dates.Note: If you leave this check box clear,java.sql.Timestamp
orjava.sql.Date
can be used depending on the pattern. - In the Spark "scratch" directory field, enter the directory in which the Studio stores in the local system the temporary files such as the jar files to be transferred. If you launch the Job on Windows, the default disk is C:. So if you leave /tmp in this field, this directory is C:/tmp.
- If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the Checkpoint directory field, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.
- In the Advanced properties table, add any Spark properties you need to use to override their default counterparts used by the Studio.