Defining a schema in a Job script - 6.5

Configuring Spark connection using the Job script API

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
Design and Development > Designing Jobs > Job Frameworks > Spark Streaming
EnrichPlatform
Talend Studio
Use the addElementParameters{} function in the addParameters{} function to define the Spark connection in a Job script.

addElementParameters {} properties

Properties relevant to selecting the Spark cluster to be used:

Function/parameter Description Mandatory?

SPARK_LOCAL_MODE

Enter "true" to run your Spark Job in the local mode. By default, the value is "false", which means to use a remote cluster.

In the local mode, the Studio builds the Spark environment in itself on the fly in order to run the Job in. Each processor of the local machine is used as a Spark worker to perform the computations.

In this mode, your local file system is used; therefore, deactivate the configuration components such as tS3Configuration or tHDFSConfiguration that provides connection information to a remote file system, if you have placed these components in your Job.

You can launch your Job without any further configuration.

Yes

SPARK_LOCAL_VERSION

Enter the Spark version to be used in the local mode. This property is relevant only when you have entered "true" for SPARK_LOCAL_MODE.

The Studio does not support all the Spark versions in the local mode. Enter one of the following values:
  • "SPARK_1_3_0"

  • "SPARK_1_4_0"

  • "SPARK_1_5_0"

  • "SPARK_1_6_0"

  • "SPARK_2_0_0"

  • "SPARK_2_1_0"

Yes when Spark local mode is used.

DISTRIBUTION

Enter the name of the provider of your distribution.

Depending on your distribution, enter one of the following values:
  • "CLOUDERA"

  • "CLOUDERA_ALTUS"

  • "GOOGLE_CLOUD_DATAPROC"

  • "HORTONWORKS"

  • "MAPR"

  • "MICROSOFT_HD_INSIGHT"

Yes when you are using neither the Spark local mode nor the Amazon EMR distribution.

SPARK_VERSION

Enter the version of your distribution.

The following list provides example formats for each available distribution:
  • "Cloudera_CDH5_12"

  • "Cloudera_Altus_CDH5_11"

  • "DATAPROC_1_1"

  • "HDP_2_6"

  • "MAPR600"

  • "MICROSOFT_HD_INSIGHT_3_6"

  • "EMR_5_5_0"

For more information about the distribution versions supported by Talend, see Supported Hadoop distribution versions for Talend Jobs.

Yes when you are not using Spark local mode.

SUPPORTED_SPARK_VERSION

Enter the Spark version used by your distribution. For example, "SPARK_2_1_0".

Yes when you are not using Spark local mode.

SPARK_API_VERSION

Enter "SPARK_200", the Spark API version used by Talend.

Yes.

SET_HDP_VERSION

Enter "true" if your Hortonworks cluster is using the hdp.version variable to store its version; otherwise, enter "false". Contact the administrator of your cluster if you are not sure about this information.

Yes when you are using Hortonworks.

HDP_VERSION

Enter Hortonwork version to be used, for example, "\"2.6.0.3-8\"". Contact the administrator of your cluster if you are not sure about this information.

You must add the version number to the yarn-site.xml file of your cluster, too. In this example, add hdp.version=2.6.0.3-8.

Yes when you have entered "true" for SET_HDP_VERSION.

SPARK_MODE

Enter the mode your Spark cluster has been implemented.

Depending on your situation, enter one of the following values:
  • "CLUSTER": means to run in the Spark Standalone mode.

  • "YARN_CLIENT"

Yes when you are not using the Spark local mode.

Properties relevant to configuring the connection to Spark:

Function/parameter Description Mandatory?

RESOURCE_MANAGER

Enter the address of the ResourceManager service of the Hadoop cluster to be used.

Yes when you are using the Yarn client mode.

SET_SCHEDULER_ADDRESS

Enter "true" if your cluster possesses a ResourceManager scheduler; otherwise, enter "false".

Yes when you are using the Yarn client mode.

RESOURCEMANAGER_SCHEDULER_ADDRESS

Enter the Scheduler address.

Yes when you have entered "true" for SET_SCHEDULER_ADDRESS.

SET_JOBHISTORY_ADDRESS

Enter "true" if your cluster possesses a JobHistory service; otherwise, enter "false".

Yes when you are using the Yarn client mode.

JOBHISTORY_ADDRESS

Enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.

Yes when you have entered "true" for SET_JOBHISTORY_ADDRESS.

SET_STAGING_DIRECTORY

Enter "true" if your cluster possesses a staging directory to store the temporary files created by running programs; otherwise, enter "false".

Yes when you are using the Yarn client mode.

STAGING_DIRECTORY

Enter this directory, for example, "\"/user\"". Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

Yes when you have entered "true" for SET_STAGING_DIRECTORY.

HDINSIGHT_ENDPOINT

Enter the endpoint of your HDInsight cluster. For example, "\"https://mycluster.azurehdinsight.net\"".

Yes when you are using the related distribution.

HDINSIGHT_USERNAME and HDINSIGHT_PASSWORD

Enter the authentication information of the HD Insight cluster to be used.

For example, "\"talendstorage\"" as username and "my_password" as password.

Yes when you are using the related distribution.

LIVY_HOST

The Hostname of Livy uses the following syntax: your_spark_cluster_name.azurehdinsight.net. For further information about the Livy service used by HD Insight, see Submit Spark jobs using Livy.

Yes when you are using the related distribution, HDInsight.

LIVY_PORT

Enter the port number of your Livy service. By default, the port number is "\"443\"".

Yes when you are using the related distribution, HDInsight.

LIVY_USERNAME

Enter your HDinsight username, for example, "\"my_hdinsight_account\"".

Yes when you are using the related distribution, HDInsight.

WASB_HOST

Enter the address of your Windows Azure Storage blob, for example, "\"https://my_storage_account_name.blob.core.windows.net\"".

Yes when you are using the related distribution, HDInsight.

WASB_CONTAINER

Enter the name of the container to be used, for example, "\"talend_container\"".

Yes when you are using the related distribution, HDInsight.

REMOTE_FOLDER

Enter the location in which you want to store the current Job and its dependent libraries in this Azure Storage account, for example, "\"/user/ychen/deployment_blob\"".

Yes when you are using the related distribution, HDInsight.

SPARK_HOST

Enter the URI of the Spark Master of the Hadoop cluster to be used, for example, "\"spark://localhost:7077\"".

Yes when you are using the Spark Standalone mode.

SPARK_HOME

Enter the location of the Spark executable installed in the Hadoop cluster to be used, for example, "\"/usr/lib/spark\"".

Yes when you are using the Spark Standalone mode.

DEFINE_HADOOP_HOME_DIR

If you need to launch from Windows, it is recommended to specify where the winutils.exe program to be used is stored.

If you know where to find your winutils.exe file and you want to use it, enter "true"; otherwise, enter "false".

Yes when you are using a distribution that is not running on cloud.

HADOOP_HOME_DIR

Enter the directory where your winutils.exe is stored, for example, "\"C:/Talend/winutils\"".

Yes when you have entered "true" for DEFINE_HADOOP_HOME_DIR.

DEFINE_SPARK_DRIVER_HOST

In the Yarn client mode of Spark, if the Spark cluster cannot recognize by itself the machine in which the Job is launched, enter "true"; otherwise, enter "false".

Yes when you are using a distribution that is not running on cloud and the Spark mode is Yarn client.

SPARK_DRIVER_HOST

Enter the host name or the IP address of this machine, for example, "\"127.0.0.1\"". This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver.

Note that in this situation, you also need to add the name and the IP address of this machine to its host file.

Yes when you have entered "true" for DEFINE_SPARK_DRIVER_HOST.

GOOGLE_PROJECT_ID

Enter the ID of your Google Cloud Platform project.

For example, "\"my-google-project\"".

Yes when you are using the related distribution.

GOOGLE_CLUSTER_ID

Enter the ID of your Dataproc cluster to be used.

For example, "\"my-cluster-id\"".

Yes when you are using the related distribution.

GOOGLE_REGION

Enter the geographic zones in which the computing resources are used and your data is stored and processed. If you do not need to specify a particular region, enter "\"global\"".

Yes when you are using the related distribution.

GOOGLE_JARS_BUCKET

As a Talend Job expects its dependent jar files for execution, specify the Google Storage directory to which these jar files are transferred so that your Job can access these files at execution.

The directory to be entered must end with a slash (/). If not existing, the directory is created on the fly but the bucket to be used must already exist.

For example, "\"gs://my-bucket/talend/jars/\"".

Yes when you are using the related distribution.

DEFINE_PATH_TO_GOOGLE_CREDENTIALS

When you launch your Job from a given machine in which Google Cloud SDK has been installed and authorized to use your user account credentials to access Google Cloud Platform, enter "false". In this situation, this machine is often your local machine.

When you launch your Job from a remote machine, such as a Jobserver, enter "true".

Yes when you are using the related distribution.

PATH_TO_GOOGLE_CREDENTIALS

Enter the directory in which this JSON file is stored in the remote machine. Very often, it is the Jobserver.

For example, "\"/user/ychen/my_credentials.json\"".

Yes when you have entered "true" for DEFINE_PATH_TO_GOOGLE_CREDENTIALS.

ALTUS_SET_CREDENTIALS

If you want to provide the Altus credentials with your Job, enter "true".

If you want to provide the Altus credentials separately, for example manually using the command altus configure in your terminal, enter "false".

Yes when you are using the related distribution.

ALTUS_ACCESS_KEY and ALTUS_SECRET_KEY

Enter your Altus access key and the directory pointing to your Altus secret key file.

For example, "\"my_access_key\"" and "\"/user/ychen/my_secret_key_file.

Yes when you have entered "true" for ALTUS_SET_CREDENTIALS.

ALTUS_CLI_PATH

Enter the path to the Cloudera Altus client, which must have been installed and activated in the machine in which your Job is executed. In production environments, this machine is typically a Talend Jobserver.

For example, "\"/opt/altuscli/altusclienv/bin/altus\"".

Yes when you are using the related distribution.

ALTUS_REUSE_CLUSTER

Enter "true" to use a Cloudera Altus cluster already existing in your Cloud service. Otherwise, enter "false" to allow the Job to create a cluster on the fly.

Yes when you are using the related distribution.

ALTUS_CLUSTER_NAME

Enter the name of the cluster to be used.

For example, "\"talend-altus-cluster\"".

Yes when you are using the related distribution.

ALTUS_ENVIRONMENT_NAME

Enter the name of the Cloudera Altus environment to be used to describe the resources allocated to the given cluster.

For example, "\"talend-altus-cluster\"".

Yes when you are using the related distribution.

ALTUS_CLOUD_PROVIDER

Enter the Cloud service that runs your Cloudera Altus cluster. Currently, only AWS is supported. So enter "\"AWS\"".

Yes when you are using the related distribution.

ALTUS_DELETE_AFTER_EXECUTION

Enter "true" if you want to remove the given cluster after the execution of your Job. Otherwise, enter "false".

Yes when you are using the related distribution.

ALTUS_S3_ACCESS_KEY and ALTUS_S3_SECRET_KEY

Enter the authentication information required to connect to the Amazon S3 bucket to be used.

Yes when you have entered "\"AWS\"" for ALTUS_CLOUD_PROVIDER.

ALTUS_S3_REGION

Enter the AWS region to be used. For example "\"us-east-1\"".

Yes when you have entered "\"AWS\"" for ALTUS_CLOUD_PROVIDER.

ALTUS_BUCKET_NAME

Enter the name of the bucket to be used to store the dependencies of your Job. This bucket must already exist. For example "\"my-bucket\"".

Yes when you have entered "\"AWS\"" for ALTUS_CLOUD_PROVIDER.

ALTUS_JARS_BUCKET

Enter the directory in which you want to store the dependencies of your Job in this given bucket, for example, "\"altus/jobjar\"". This directory is created if it does not exist at runtime.

Yes when you have entered "\"AWS\"" for ALTUS_CLOUD_PROVIDER.

ALTUS_USE_CUSTOM_JSON

Enter "true if you need to manually edit JSON code to configure your Altus cluster. Otherwise, enter "false".

Yes when you are using the related distribution.

ALTUS_CUSTOM_JSON

Enter your custom json code, for example, "{my_json_code}".

Yes when you have entered "true for ALTUS_USE_CUSTOM_JSON.

ALTUS_INSTANCE_TYPE

Enter the instance type for the instances in the cluster. All nodes that are deployed in this cluster use the same instance type. For example, "\"c4.2xlarge\"".

Yes when you are using the related distribution.

ALTUS_WORKER_NODE

Enter the number of worker nodes to be created for the cluster.

For example, "\"10\"".

Yes when you are using the related distribution.

ALTUS_CLOUDERA_MANAGER_USERNAME

Enter the authentication information to your Cloudera Manager service.

For example, "\"altus\"".

Yes when you are using the related distribution.

SPARK_SCRATCH_DIR

Enter the directory to stores in the local system the temporary files such as the Job dependencies to be transferred, for example, "\"/tmp\"".

Yes.

STREAMING_BATCH_SIZE

Enter the time interval (ms) at the end of which the Job reviews the source data to identify changes and processes the new micro batches, for example, "1000".

Yes when you are developing a Spark Streaming Job.

DEFINE_DURATION

If you need to define a streaming timeout (ms), enter "true". Otherwise, enter "false".

Yes when you are developing a Spark Streaming Job.

STREAMING_DURATION

Enter the time frame (ms) at the end of which the streaming Job automatically stops running, for example, "10000".

Yes when you have entered "true for DEFINE_DURATION.

SPARK_ADVANCED_PROPERTIES

Enter the code to use other Hadoop or Spark related properties.

For example:
{
PROPERTY : "\"spark.yarn.am.extraJavaOptions\"",
VALUE : "\"-Dhdp.version=2.4.0.0-169\"",
BUILDIN : "TRUE"
}

No.

Properties relevant to defining the security configuration:

Function/parameter Description Mandatory?

USE_KRB

Enter "true" if the cluster to be used is secured with Kerberos. Otherwise, enter "false".

For more information about the distribution versions for which Talend provides the support for Kerberos, see Supported Hadoop distribution versions for Talend Jobs.

Yes

RESOURCEMANAGER_PRINCIPAL

Enter the Kerberos principal names for the ResourceManager service, for example, "\"yarn/_HOST@EXAMPLE.COM\"".

Yes when you are using Kerberos and the Yarn client mode.

JOBHISTORY_PRINCIPAL

Enter the Kerberos principal names for the JobHistory service, for example, "\"mapred/_HOST@EXAMPLE.COM\"".

Yes when you are using Kerberos and the Yarn client mode.

USE_KEYTAB

If you need to use a Kerberos keytab file to log in, enter "true". Otherwise, enter "false".

Yes when you are using Kerberos.

PRINCIPAL

Enter the principal to be used, for example, "\"hdfs\"".

Yes when you are using a Kerberos keytab file.

KEYTAB_PATH

Enter the access path to the keytab file itself. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend Jobserver.

For example, "\"/tmp/hdfs.headless.keytab\"".

Yes when you are using a Kerberos keytab file.

USERNAME

Enter the login user name for your distribution. If you leave it empty, that is to say "\"\"", the user name of the machine in which your Job actually runs will be used.

Yes when you are not using Kerberos.

USE_MAPRTICKET

If the MapR cluster to be used is secured with the MapR ticket authentication mechanism, enter "true". Otherwise, enter "false".

Yes when you are using a MapR cluster.

MAPRTICKET_PASSWORD

Enter the password to be used to log into MapR, for example, "my_password".

Yes when you are not using Kerberos but are using MapR ticket authentication mechanism.

MAPRTICKET_CLUSTER

Enter the name of the MapR cluster you want to connect to, for example, "\"demo.mapr.com\"".

Yes when you are using MapR ticket authentication mechanism.

MAPRTICKET_DURATION

Enter the length of time (in seconds) during which the ticket is valid, for example, "86400L".

Yes when you are using MapR ticket authentication mechanism.

SET_MAPR_HOME_DIR

If the location of the MapR configuration files has been changed to somewhere else in the cluster, that is to say, the MapR Home directory has been changed, enter "true". Otherwise, enter "false".

Yes when you are using MapR ticket authentication mechanism.

MAPR_HOME_DIR

Enter the new Home directory, for example, "\"/opt/mapr/custom/\"".

Yes when you have entered "true for SET_MAPR_HOME_DIR.

SET_HADOOP_LOGIN

If the login module to be used has been changed in the MapR security configuration file, mapr.login.conf, enter "true". Otherwise, enter "false".

Yes when you are using MapR ticket authentication mechanism.

HADOOP_LOGIN

Enter the module to be called from the mapr.login.conf file, for example, "\"kerberos\"" means to call the hadoop_kerberos module.

Yes when you have entered "true for SET_HADOOP_LOGIN.

Properties relevant to tuning Spark:

Function/parameter Description Mandatory?

ADVANCED_SETTINGS_CHECK

Enter "true" if you need to optimize the allocation of the resources to be used to run your Jobs. Otherwise, enter "false".

Yes.

SPARK_DRIVER_MEM and SPARK_DRIVER_CORES

Enter the allocation size of memory and the number of cores to be used by the driver of the current Job, for example, "\"512m\"", for memory and "\"1\"" for the number of cores.

Yes when you are tuning Spark in the Standalone mode.

SPARK_YARN_AM_SETTINGS_CHECK

Enter "true" to define the ApplicationMaster tuning properties of your Yarn cluster. Otherwise, enter "false".

Yes when you are tuning Spark in the Yarn client mode.

SPARK_YARN_AM_MEM and SPARK_YARN_AM_CORES

Enter the allocation size of memory to be used by the ApplicationMaster, for example, "\"512m\"", for memory and "\"1\"" for the number of cores.

Yes when you have entered "true" for SPARK_YARN_AM_SETTINGS_CHECK.

SPARK_EXECUTOR_MEM

Enter the allocation size of memory to be used by each Spark executor, for example, "\"512m\"".

Yes when you are tuning Spark.

SET_SPARK_EXECUTOR_MEM_OVERHEAD

Enter "true" if you need to allocate the amount of off-heap memory (in MB) per executor. Otherwise, enter "false".

Yes when you are tuning Spark in the Yarn client mode.

SPARK_EXECUTOR_MEM_OVERHEAD

Enter the amount of off-heap memory (in MB) to be allocated per executor.

Yes when you have entered "true" for SET_SPARK_EXECUTOR_MEM_OVERHEAD.

SPARK_EXECUTOR_CORES_CHECK

If you need to define the number of cores to be used by each executor, enter "true". Otherwise, enter "false".

Yes when you are tuning Spark.

SPARK_EXECUTOR_CORES

Enter the number of cores to be used by each executor, for example, "\"1\"".

Yes when you have entered "true" for SPARK_EXECUTOR_CORES_CHECK.

SPARK_YARN_ALLOC_TYPE

Enter how you want Yarn to allocate resources among executors.

Enter one of the following values:
  • "AUTO": means to let Yarn use its default number of executors. This number is 2.

  • "FIXED": means to define the number of executors to be used with SPARK_EXECUTOR_INSTANCES.

  • "DYNAMIC": means to let Yarn adapt the number of executors to suit the workload. Then you need to define SPARK_YARN_DYN_INIT, SPARK_YARN_DYN_MIN and SPARK_YARN_DYN_MAX.

Yes when you are tuning Spark in the Yarn client mode.

SPARK_EXECUTOR_INSTANCES

Enter the number of executors to be used by Yarn, for example, "\"2\"".

Yes when you have entered "FIXED" for SPARK_YARN_ALLOC_TYPE.

SPARK_YARN_DYN_INIT, SPARK_YARN_DYN_MIN and SPARK_YARN_DYN_MAX

Define the scale of the dynamic allocation by defining these three properties. For example, "\"1\"" as the number of initial executor, "\"0\"" as the minimum number and "\"MAX\"" as the maximum number.

Yes when you have entered "DYNAMIC" for SPARK_YARN_ALLOC_TYPE.

WEB_UI_PORT_CHECK

If you need to change the default port of the Spark Web UI, enter "true". Otherwise, enter "false".

Yes when you are tuning Spark.

WEB_UI_PORT

Enter the port number you want to use for the Spark Web UI, for example, "\"4040\"".

Yes when you have entered "true" for WEB_UI_PORT_CHECK.

SPARK_BROADCAST_FACTORY

Enter the broadcast implementation to be used to cache variables on each worker machine.

Enter one of the following values:
  • "AUTO"

  • "TORRENT"

  • "HTTP"

Yes when you are tuning Spark.

CUSTOMIZE_SPARK_SERIALIZER

If you need to import an external Spark serializer, enter "true". Otherwise, enter "false".

Yes when you are tuning Spark.

SPARK_SERIALIZER

Enter the fully qualified class name of the serializer to be used, for example, "\"org.apache.spark.serializer.KryoSerializer\"".

Yes when you have entered "true" for CUSTOMIZE_SPARK_SERIALIZER.

ENABLE_BACKPRESSURE

If you need to enable the backpressure feature of Spark, enter "true". Otherwise, enter "false".

The backpressure feature is available in the Spark verson 1.5 and onwards. With backpress enabled, Spark automatically finds the optimal receiving rate and dynamically adapts the rate based on current batch scheduling delays and processing time, in order to receive data only as fast as it can process.

Yes when you are tuning Spark for a Spark Streaming Job.

Properties relevant to logging the execution of your Jobs:

Function/parameter Description Mandatory?

ENABLE_SPARK_EVENT_LOGGING

Enter "true" if you need to enable the Spark application logs of this Job to be persistent in the file system of your Yarn cluster. Otherwise, enter "false".

Yes when you are using Spark in the Yarn client mode.

COMPRESS_SPARK_EVENT_LOGS

If you need to compress the logs, enter "true". Otherwise, enter "false".

Yes when you have entered "true" for ENABLE_SPARK_EVENT_LOGGING.

SPARK_EVENT_LOG_DIR

Enter the directory in which Spark events are logged, for example, "\"hdfs://namenode:8020/user/spark/applicationHistory\"".

Yes when you have entered "true" for ENABLE_SPARK_EVENT_LOGGING.

SPARKHISTORY_ADDRESS

Enter the location of the history server, for example, "\"sparkHistoryServer:18080\"".

Yes when you have entered "true" for ENABLE_SPARK_EVENT_LOGGING.

USE_CHECKPOINT

If you need the Job to be resilient to failure, enter "true" to enable the Spark checkpointing operation. Otherwise, enter "false".

Yes.

CHECKPOINT_DIR

Enter the directory in which Spark stores, in the file system of the cluster, the context data of the computations such as the metadata and the generated RDDs of this computation. For example, "\"file:///tmp/mycheckpoint\"".

Yes when you have entered "true" for SET_SPARK_EXECUTOR_MEM_OVERHEAD.

Properties relevant to configuring Cloudera Navigator:

If you are using Cloudera V5.5+ to run your MapReduce or Apache Spark Batch Jobs, you can make use of Cloudera Navigator to trace the lineage of given data flow to discover how this data flow was generated by a Job.

For more information about the supported Cloudera Navigator versions by Talend, see Supported Hadoop distribution versions for Talend Jobs.

Function/parameter Description Mandatory?

USE_CLOUDERA_NAVIGATOR

Enter "true" if you want to use Cloudera Navigator. Otherwise, enter "false".

Yes when you are using Spark on Cloudera.

CLOUDERA_NAVIGATOR_USERNAME and CLOUDERA_NAVIGATOR_PASSWORD

Enter the credentials you use to connect to your Cloudera Navigator. For example, "\"username\"" as username and "password" as password.

Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR.

CLOUDERA_NAVIGATOR_URL

Enter the location of the Cloudera Navigator to connect to, for example, "\"http://localhost:7187/api/v8/\"".

Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR.

CLOUDERA_NAVIGATOR_METADATA_URL

Enter the location of the Navigator Metadata, for example, "\"http://localhost:7187/api/v8/metadata/plugin\"".

Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR.

CLOUDERA_NAVIGATOR_CLIENT_URL

Enter the location of the Navigator client, for example, "\"http://localhost\"".

Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR.

CLOUDERA_NAVIGATOR_AUTOCOMMIT

If you want to make Cloudera Navigator generate the lineage of the current Job at the end of the execution of your Job, enter "true". Otherwise, enter "false".

Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR.

CLOUDERA_NAVIGATOR_DISABLE_SSL_VALIDATION

If you do not want to use the SSL validation process when your Job connects to Cloudera Navigator, enter "true". Otherwise, enter "false".

Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR.

CLOUDERA_NAVIGATOR_DIE_ON_ERROR

If you want to stop the execution of the Job when the connection to your Cloudera Navigator fails, enter "true". Otherwise, enter "false".

Yes when you have entered "true" for USE_CLOUDERA_NAVIGATOR.

Properties relevant to configuring Hortonworks Atlas:

If you are using Hortonworks Data Platform V2.4 onwards to run your Spark Batch Jobs and Apache Atlas has been installed in your Hortonworks cluster, you can make use of Atlas to trace the lineage of given data flow to discover how this data flow was generated by a Job.

Function/parameter Description Mandatory?

USE_ATLAS

Enter "true" if you want to use Atlas. Otherwise, enter "false".

Yes when you are using Spark on Hortonworks.

ATLAS_USERNAME and ATLAS_PASSWORD

Enter the credentials you use to connect to your Atlas. For example, "\"username\"" as username and "password" as password.

Yes when you have entered "true" for USE_ATLAS.

ATLAS_URL

Enter the location of the Atlas to connect to, for example, "\"http://localhost:21000\""

Yes when you have entered "true" for USE_ATLAS.

SET_ATLAS_APPLICATION_PROPERTIES

If your Atlas cluster contains custom properties such as SSL or read timeout, enter "true". Otherwise, enter "false".

Yes when you have entered "true" for USE_ATLAS.

ATLAS_APPLICATION_PROPERTIES

Enter a directory in your local machine, then place the atlas-application.properties file of your Atlas in this directory, for example, "\"/user/atlas/atlas-application.properties\"".

This way, your Job is enabled to use these custom properties.

Yes when you have entered "true" for SET_ATLAS_APPLICATION_PROPERTIES.

ATLAS_DIE_ON_ERROR

If you want to stop the Job execution when Atlas-related issues occur, enter "true". Otherwise, enter "false".

Yes when you have entered "true" for USE_ATLAS.