Adding advanced Spark properties to solve issues - Cloud

Depending on the distribution you are using or the issues you encounter, you may need to add specific Spark properties to the Advanced properties table in the Spark configuration tab of the Run view of your Job.

Alternatively, define a Hadoop connection metadata in the Repository and in its wizard, select the Use Spark properties check box to open the properties table and add the property or properties to be used, for example, from spark-defaults.conf of your cluster. When you reuse this connection in your Apache Spark Jobs, the advanced Spark properties you have added there are automatically added to the Spark configurations for those Jobs.

The information in this section is only for users who have subscribed to Talend Data Fabric or to any Talend product with Big Data.

The advanced properties required by different Hadoop distributions or by some common issues and their values are listed below:

For further information about the valid Spark properties, see Spark documentation at https://spark.apache.org/docs/latest/configuration.

Specific Spark timeout	When encountering network issues, Spark by default waits for up to 45 minutes before stopping its attempts to submits Jobs. Then, Spark triggers the automatic stop of your Job. Add the following properties to the Hadoop properties table of tHDFSConfiguration to reduce this duration. `ipc.client.ping`: `false`. This prevents pinging if server does not answer. `ipc.client.connect.max.retries`: `0`. This indicates the number of retries if the demand for connection is answered but refused. `yarn.resourcemanager.connect.retry-interval.ms`: any number. This indicates how often to try to connect to the ResourceManager service until Spark gives up.

Specific Spark timeout

When encountering network issues, Spark by default waits for up to 45 minutes before stopping its attempts to submits Jobs. Then, Spark triggers the automatic stop of your Job.

Add the following properties to the Hadoop properties table of tHDFSConfiguration to reduce this duration.

ipc.client.ping: false. This prevents pinging if server does not answer.
ipc.client.connect.max.retries: 0. This indicates the number of retries if the demand for connection is answered but refused.
yarn.resourcemanager.connect.retry-interval.ms: any number. This indicates how often to try to connect to the ResourceManager service until Spark gives up.

Hortonworks Data Platform V2.4	`spark.yarn.am.extraJavaOptions`: `-Dhdp.version=2.4.0.0-169` `spark.driver.extraJavaOptions`: `-Dhdp.version=2.4.0.0-169` In addition, you need to add `-Dhdp.version=2.4.0.0-169` to the JVM settings area either in the Advanced settings tab of the Run view or in the Talend> Run/Debug view of the Preferences window. Setting this argument in the Preferences window applies it on all the Jobs that are designed in the same Talend Studio.

Hortonworks Data Platform V2.4

spark.yarn.am.extraJavaOptions: -Dhdp.version=2.4.0.0-169
spark.driver.extraJavaOptions: -Dhdp.version=2.4.0.0-169

In addition, you need to add -Dhdp.version=2.4.0.0-169 to the JVM settings area either in the Advanced settings tab of the Run view or in the Talend> Run/Debug view of the Preferences window. Setting this argument in the Preferences window applies it on all the Jobs that are designed in the same Talend Studio.

MapR V5.1 and V5.2	When the cluster is used with the HBase or the MapRDB components: spark.hadoop.yarn.application.classpath: enter the value of this parameter specific to your cluster and add, if missing, the classpath to HBase to ensure that the Job to be used can find the required classes and packages in the cluster. For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a comma(,). The added paths is where HBase is usually installed in a MapR cluster. If your HBase is installed elsewhere, contact the administrator of your cluster for details and adapt these paths accordingly.

MapR V5.1 and V5.2

When the cluster is used with the HBase or the MapRDB components:

spark.hadoop.yarn.application.classpath: enter the value of this parameter specific to your cluster and add, if missing, the classpath to HBase to ensure that the Job to be used can find the required classes and packages in the cluster.

For example, if the HBase version installed in the cluster is 1.1.1, copy and paste all the paths defined with the spark.hadoop.yarn.application.classpath parameter from your cluster and then add opt/mapr/hbase/hbase-1.1.1/lib/* and /opt/mapr/lib/* to these paths, separating each path with a comma(,). The added paths is where HBase is usually installed in a MapR cluster. If your HBase is installed elsewhere, contact the administrator of your cluster for details and adapt these paths accordingly.

Cloudera CDP	When the default configuration of the Job does not provide all the dependencies that are necessary for the Job to run, you must specify the classpaths of the missing dependencies with the spark.hadoop.yarn.application.classpath property. For example, add the paths to the Spark JARs and to the Hive libraries, separating each path with a comma (,): `"opt/cloudera/parcels/CDP/lib/spark/jars/, opt/cloudera/parcels/CDP/lib/hive/libs/"`

Cloudera CDP

When the default configuration of the Job does not provide all the dependencies that are necessary for the Job to run, you must specify the classpaths of the missing dependencies with the spark.hadoop.yarn.application.classpath property.

For example, add the paths to the Spark JARs and to the Hive libraries, separating each path with a comma (,): "opt/cloudera/parcels/CDP/lib/spark/jars/*, opt/cloudera/parcels/CDP/lib/hive/libs/*"

Security	In the machine where Talend Studio with Big Data is installed, some scanning tools can report a CVE vulnerability issue related to Spark, while this issue does not actually impact Spark, as is explained by the Spark community, because this vulnerability concerns the Apache Thrift Go client library only but Spark does not use this library. Therefore this alert is not relevant to Talend Studio and thus no action is required.

Security

In the machine where Talend Studio with Big Data is installed, some scanning tools can report a CVE vulnerability issue related to Spark, while this issue does not actually impact Spark, as is explained by the Spark community, because this vulnerability concerns the Apache Thrift Go client library only but Spark does not use this library. Therefore this alert is not relevant to Talend Studio and thus no action is required.

Adding advanced Spark properties to solve issues - Cloud - 8.0

Spark Batch