Tuning Spark for Apache Spark Batch Jobs - 6.5

Spark Batch

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
EnrichPlatform
Talend Studio

You can define the tuning parameters in the Spark configuration tab of the Run view of your Spark Job to obtain better performance of the Job, if the default values of these parameters does not produce sufficient performance.

Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data, but it is not applicable to Talend Open Studio for Big Data users.

Procedure

  1. Select the Set Tuning properties check box to optimize the allocation of the resources to be used to run this Job. These properties are not mandatory for the Job to run successfully, but they are useful when Spark is bottlenecked by any resource issue in the cluster such as CPU, bandwidth or memory.
  2. Calculate the initial resource allocation as the point to start the tuning.
    A generic formula for this calculation is
    • Number of executors = (Total cores of the cluster) / 2

    • Number of cores per executor = 2

    • Memory per executor = (Up to total memory of the cluster) / (Number of executors)

  3. Define each parameter and if needed, revise them until you obtain the satisfactory performance.

    Spark Standalone mode

    • Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job.

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.

    Spark Yarn client mode

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Set executor memory overhead: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Yarn resource allocation: select how you want Yarn to allocate resources among executors.
      • Auto: you let Yarn use its default number of executors. This number is 2.

      • Fixed: you need to enter the number of executors to be used in the Num executors that is displayed.

      • Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.