Speeding up Job execution with Apache Spark on Yarn

EnrichVersion
6.5
EnrichProdName
Talend Real-Time Big Data Platform
Talend Data Fabric
Talend Big Data
Talend Big Data Platform
task
Design and Development > Designing Jobs > Hadoop distributions
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
Design and Development > Designing Jobs > Job Frameworks > Spark Streaming
EnrichPlatform
Talend Studio

Speeding up Job execution with Apache Spark on Yarn

Every time when a Spark Job is launched, its dependences are automatically transferred to the Yarn cluster in which this Job is executed. Manually upload these dependences to avoid this time-consuming transfer and thus shorten the execution time of the Spark Job.

The procedure explained in this article is applied to the Talend Jobs running on Spark 2.0 onwards only. If you are using a Spark version prior to 2.0, see Upload the assembly file.

Uploading the dependencies and specifying the path

Upload the Spark dependencies to Yarn and specify the spark.yarn.jars property.

Procedure

  1. Use tools such as Putty or SSH to copy the Spark dependences to the HDFS system of the Yarn cluster to be used to execute your Spark Jobs.
    By default, the dependencies to be uploaded are stored in spark_instllation/jars.
  2. In Studio, open the Job you want to run
  3. Double-click Run to open its view,.
  4. Click the Spark configuration tab.
  5. In the Advanced properties table, to add a row, click the plus symbol (+).
  6. In the Property column, in double quotation marks, enter spark.yarn.jars. This parameter provides the names of jar files to be used by your Spark Job, as well as the paths to them in your cluster.
  7. In the Value column, in double quotation marks, enter the names of the dependency files and their directories. For example, if these dependency files have been uploaded to usr/lib/spark/jars directory, enter "usr/lib/spark/jars/*".