Tuning Spark for Apache Spark Batch Jobs - 7.3

Spark Batch

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
Last publication date
2024-02-21

You can define the tuning parameters in the Spark configuration tab of the Run view of your Spark Job to obtain better performance of the Job, if the default values of these parameters does not produce sufficient performance.

Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Procedure

  1. Select the Set Tuning properties check box to optimize the allocation of the resources to be used to run this Job. These properties are not mandatory for the Job to run successfully, but they are useful when Spark is bottlenecked by any resource issue in the cluster such as CPU, bandwidth or memory.
  2. Calculate the initial resource allocation as the point to start the tuning.
    A generic formula for this calculation is
    • Number of executors = (Total cores of the cluster) / 2

    • Number of cores per executor = 2

    • Memory per executor = (Up to total memory of the cluster) / (Number of executors)

  3. Define each parameter and if needed, revise them until you obtain the satisfactory performance.
    The following table provides the exhaustive list of the tuning properties. The actual properties available in the Spark configuration tab could vary depending on the distribution you are using.

    Spark Standalone mode

    • Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job.

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.

    • Job progress polling rate (in ms): when using Spark V2.3 and onwards, enter the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the execution progress of your Job. Before V2.3, Spark automatically sends this information to the Studio when updates occur; the default value, 50 milliseconds, of this parameter allows the Studio to reproduce more or less the same scenario with Spark V2.3 and onwards.

      If you set this interval too long, you may lose information about the progress; if too short, you may send too many requests to Spark for only insignificant progress information.

    Spark Yarn client mode

    • Set application master tuning properties: select this check box and in the fields that are displayed, enter the amount of memory and the number of CPUs to be allocated to the ApplicationMaster service of your cluster.

      If you want to use the default allocation of your cluster, leave this check box clear.

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Set executor memory overhead: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Yarn resource allocation: select how you want Yarn to allocate resources among executors.
      • Auto: you let Yarn use its default number of executors. This number is 2.

      • Fixed: you need to enter the number of executors to be used in the Num executors that is displayed.

      • Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.

    • Job progress polling rate (in ms): when using Spark V2.3 and onwards, enter the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the execution progress of your Job. Before V2.3, Spark automatically sends this information to the Studio when updates occur; the default value, 50 milliseconds, of this parameter allows the Studio to reproduce more or less the same scenario with Spark V2.3 and onwards.

      If you set this interval too long, you may lose information about the progress; if too short, you may send too many requests to Spark for only insignificant progress information.

    Spark Yarn cluster mode

    • Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job.

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Set executor memory overhead: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Yarn resource allocation: select how you want Yarn to allocate resources among executors.
      • Auto: you let Yarn use its default number of executors. This number is 2.

      • Fixed: you need to enter the number of executors to be used in the Num executors that is displayed.

      • Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.

    • Job progress polling rate (in ms): when using Spark V2.3 and onwards, enter the time interval (in milliseconds) at the end of which you want the Studio to ask Spark for the execution progress of your Job. Before V2.3, Spark automatically sends this information to the Studio when updates occur; the default value, 50 milliseconds, of this parameter allows the Studio to reproduce more or less the same scenario with Spark V2.3 and onwards.

      If you set this interval too long, you may lose information about the progress; if too short, you may send too many requests to Spark for only insignificant progress information.