Tuning Spark for Apache Spark Streaming Jobs - Cloud - 8.0

Spark Streaming

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Streaming
Last publication date
2024-02-20

You can define the tuning parameters in the Spark configuration tab of the Run view of your Spark Job to obtain better performance of the Job, if the default values of these parameters does not produce sufficient performance.

Generally speaking, Spark performs better with lower amount of big tasks than with high amount of small tasks.

The information in this section applies only to Talend Real-Time Big Data Platform and Talend Data Fabric.

Procedure

  1. Select the Set Tuning properties check box to optimize the allocation of the resources to be used to run this Job. These properties are not mandatory for the Job to run successfully, but they are useful when Spark is bottlenecked by any resource issue in the cluster such as CPU, bandwidth or memory.
  2. Calculate the initial resource allocation as the point to start the tuning.
    A generic formula for this calculation is
    • Number of executors = (Total cores of the cluster) / 2

    • Number of cores per executor = 2

    • Memory per executor = (Up to total memory of the cluster) / (Number of executors)

  3. Define each parameter and if needed, revise them until you obtain the satisfactory performance.

    Spark Standalone mode

    • Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job.

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.

    • Activate backpressure: select this check box to enable the backpressure feature of Spark. The backpressure feature is available in the Spark version 1.5 and onwards. With backpress enabled, Spark automatically finds the optimal receiving rate and dynamically adapts the rate based on current batch scheduling delays and processing time, in order to receive data only as fast as it can process.

    Spark Yarn client mode

    • Set application master tuning properties: select this check box and in the fields that are displayed, enter the amount of memory and the number of CPUs to be allocated to the ApplicationMaster service of your cluster.

      If you want to use the default allocation of your cluster, leave this check box clear.

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Set executor memory overhead: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Yarn resource allocation: select how you want Yarn to allocate resources among executors.
      • Auto: you let Yarn use its default number of executors. This number is 2.

      • Fixed: you need to enter the number of executors to be used in the Num executors that is displayed.

      • Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.

    • Activate backpressure: select this check box to enable the backpressure feature of Spark. The backpressure feature is available in the Spark version 1.5 and onwards. With backpress enabled, Spark automatically finds the optimal receiving rate and dynamically adapts the rate based on current batch scheduling delays and processing time, in order to receive data only as fast as it can process.

    Spark Yarn cluster mode

    • Driver memory and Driver core: enter the allocation size of memory and the number of cores to be used by the driver of the current Job.

    • Executor memory: enter the allocation size of memory to be used by each Spark executor.

    • Set executor memory overhead: select this check box and in the field that is displayed, enter the amount of off-heap memory (in MB) to be allocated per executor. This is actually the spark.yarn.executor.memoryOverhead property.

    • Core per executor: select this check box and in the displayed field, enter the number of cores to be used by each executor. If you leave this check box clear, the default allocation defined by Spark is used, for example, all available cores are used by one single executor in the Standalone mode.

    • Yarn resource allocation: select how you want Yarn to allocate resources among executors.
      • Auto: you let Yarn use its default number of executors. This number is 2.

      • Fixed: you need to enter the number of executors to be used in the Num executors that is displayed.

      • Dynamic: Yarn adapts the number of executors to suit the workload. You need to define the scale of this dynamic allocation by defining the initial number of executors to run in the Initial executors field, the lowest number of executors in the Min executors field and the largest number of executors in the Max executors field.

    • Set Web UI port: if you need to change the default port of the Spark Web UI, select this check box and enter the port number you want to use.

    • Broadcast factory: select the broadcast implementation to be used to cache variables on each worker machine.

    • Customize Spark serializer: if you need to import an external Spark serializer, select this check box and in the field that is displayed, enter the fully qualified class name of the serializer to be used.

    • Activate backpressure: select this check box to enable the backpressure feature of Spark. The backpressure feature is available in the Spark version 1.5 and onwards. With backpress enabled, Spark automatically finds the optimal receiving rate and dynamically adapts the rate based on current batch scheduling delays and processing time, in order to receive data only as fast as it can process.