Adding advanced Spark properties to solve issues - 7.0

Spark Batch

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
EnrichPlatform
Talend Studio

Depending on the distribution you are using or the issues you encounter, you may need to add specific Spark properties to the Advanced properties table in the Spark configuration tab of the Run view of your Job.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data but it is not applicable to Talend Open Studio for Big Data users.

The advanced properties required by different Hadoop distributions or by some common issues and their values are listed below:

For further information about the valid Spark properties, see Spark documentation at https://spark.apache.org/docs/latest/configuration.

Specific Spark timeout

When encountering network issues, Spark by default waits for up to 45 minutes before stopping its attempts to submits Jobs. Then, Spark triggers the automatic stop of your Job.

Add the following properties to the Hadoop properties table of tHDFSConfiguration to reduce this duration.

  • ipc.client.ping: false. This prevents pinging if server does not answer.

  • ipc.client.connect.max.retries: 0. This indicates the number of retries if the demand for connection is answered but refused.

  • yarn.resourcemanager.connect.retry-interval.ms: any number. This indicates how often to try to connect to the ResourceManager service until Spark gives up.

Hortonworks Data Platform V2.4

  • spark.yarn.am.extraJavaOptions: -Dhdp.version=2.4.0.0-169

  • spark.driver.extraJavaOptions: -Dhdp.version=2.4.0.0-169

In addition, you need to add -Dhdp.version=2.4.0.0-169 to the JVM settings area either in the Advanced settings tab of the Run view or in the Talend > Run/Debug view of the [Preferences] window. Setting this argument in the [Preferences] window applies it on all the Jobs that are designed in the same Studio.

MapR V5.1 and V5.2

When the cluster is used with the HBase or the MapRDB components:

spark.hadoop.yarn.application.classpath: enter the value of this parameter specific to your cluster and add, if missing, the classpath to HBase to ensure that the Job to be used can find the required classes and packages in the cluster.

For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a comma(,). The added paths is where HBase is usually installed in a MapR cluster. If your HBase is installed elsewhere, contact the administrator of your cluster for details and adapt these paths accordingly.

For a step-by-step explanation about how to add this parameter, see HBase/MapR-DB Jobs cannot run successfully with MapR.