Logging and checkpointing the activities of your Apache Spark Job - 7.3

Spark Batch

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
Last publication date
2024-02-21

It is recommended to activate the Spark logging and checkpointing system in the Spark configuration tab of the Run view of your Spark Job, in order to help debug and resume your Spark Job when issues arise.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

Procedure

  1. If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.

    For more information about the Spark checkpointing operation, see the official Spark documentation.

  2. In the Yarn client mode or the Yarn cluster mode, you can enable the Spark application logs of this Job to be persistent in the file system. To do this, select the Enable Spark event logging check box.
    The parameters relevant to Spark logs are displayed:
    • Spark event logs directory: enter the directory in which Spark events are logged. This is actually the spark.eventLog.dir property.

    • Spark history server address: enter the location of the history server. This is actually the spark.yarn.historyServer.address property.

    • Compress Spark event logs: if needs be, select this check box to compress the logs. This is actually the spark.eventLog.compress property.

    Since the administrator of your cluster can have defined these properties in the cluster configuration files, it is recommended to contact the administrator for the exact values.

  3. If you want to print the Spark context that your Job starts in the log, add the spark.logConf property in the Advanced properties table and enter, within double quotation marks, true in the Value column of this table.

    Since the administrator of your cluster can have defined these properties in the cluster configuration files, it is recommended to contact the administrator for the exact values.