Logging and checkpointing the activities of your Apache Spark Job - 6.5

Spark Batch

Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Studio
Design and Development > Designing Jobs > Job Frameworks > Spark Batch

It is recommended to activate the Spark logging and checkpointing system in the Spark configuration tab of the Run view of your Spark Job, in order to help debug and resume your Spark Job when issues arise.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data, but it is not applicable to Talend Open Studio for Big Data users.


  1. If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the Spark checkpointing operation. In the field that is displayed, enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation.

    For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

  2. In the Yarn client mode, you can enable the Spark application logs of this Job to be persistent in the file system. To do this, select the Enable Spark event logging check box.
    The parameters relevant to Spark logs are displayed:
    • Spark event logs directory: enter the directory in which Spark events are logged. This is actually the spark.eventLog.dir property.

    • Spark history server address: enter the location of the history server. This is actually the spark.yarn.historyServer.address property.

    • Compress Spark event logs: if needs be, select this check box to compress the logs. This is actually the spark.eventLog.compress property.

    Since the administrator of your cluster could have defined these properties in the cluster configuration files, it is recommended to contact the administrator for the exact values.