Speeding up Job execution with Apache Spark on Yarn
Every time when a Spark Job is launched, its dependences are automatically transferred to the Yarn cluster in which this Job is executed. Manually upload these dependences to avoid this time-consuming transfer and thus shorten the execution time of the Spark Job.
The procedure explained in this article is applied to the Talend Jobs running on Spark 2.0 onwards only. If you are using a Spark version prior to 2.0, see Upload the assembly file.
Uploading the dependencies and specifying the path
Use tools such as Putty or SSH to copy the Spark dependences to the HDFS system
of the Yarn cluster to be used to execute your Spark Jobs.
By default, the dependencies to be uploaded are stored in spark_instllation/jars.
- In Studio, open the Job you want to run
- Double-click Run to open its view,.
- Click the Spark configuration tab.
- In the Advanced properties table, to add a row, click the plus symbol (+).
- In the Property column, in double quotation marks, enter spark.yarn.jars. This parameter provides the names of jar files to be used by your Spark Job, as well as the paths to them in your cluster.
- In the Value column, in double quotation marks, enter the names of the dependency files and their directories. For example, if these dependency files have been uploaded to usr/lib/spark/jars directory, enter "hdfs://clustername:clusterport/usr/lib/spark/jars/*.jar".