Yarn client
|
Talend Studio runs the Spark driver to orchestrate how the Job should be performed and
then send the orchestration to the Yarn service of a given Hadoop cluster so
that the Resource Manager of this Yarn service requests execution resources
accordingly.
If you are using the Yarn client
mode, you need to set the following parameters in their corresponding
fields (if you leave the check box of a service clear, then at runtime,
the configuration about this parameter in the Hadoop cluster to be used
will be ignored):
-
In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
-
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
-
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
-
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
-
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the Resource
Manager service and the Job History service in the displayed fields. This enables
you to use your username to authenticate against the credentials stored in Kerberos.
These principals can be found in the configuration files of your distribution, such
as in yarn-site.xml and in mapred-site.xml.
If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend JobServer.
Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the username you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
-
The User name field is available when you
are not using Kerberos to authenticate. In the User
name field, enter the login username for your distribution. If you leave
it empty, the username of the machine hosting Talend Studio will
be used.
-
If the Spark cluster cannot recognize the machine in which the Job is
launched, select this Define the driver hostname or IP
address check box and enter the host name or the IP address of
this machine. This allows the Spark master and its workers to recognize this
machine to find the Job and thus its driver.
Note that in this situation, you also need to add the name and the IP
address of this machine to its host file.
|
Yarn cluster
|
The Spark driver runs in your Yarn cluster to orchestrate how the Job
should be performed.
If you are using the Yarn cluster mode, you need
to define the following parameters in their corresponding fields (if you
leave the check box of a service clear, then at runtime, the
configuration about this parameter in the Hadoop cluster to be used will
be ignored):
-
In the Resource managerUse datanode
field, enter the address of the ResourceManager service of the Hadoop cluster to
be used.
-
Select the Set resourcemanager
scheduler address check box and enter the Scheduler address in
the field that appears.
-
Select the Set jobhistory
address check box and enter the location of the JobHistory
server of the Hadoop cluster to be used. This allows the metrics information of
the current Job to be stored in that JobHistory server.
-
Select the Set staging
directory check box and enter this directory defined in your
Hadoop cluster for temporary files created by running programs. Typically, this
directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files
such as yarn-site.xml or mapred-site.xml of your distribution.
-
Set path to custom Hadoop
configuration JAR: if you are using
connections defined in Repository to
connect to your Cloudera or Hortonworks cluster, you can
select this check box in the
Repository wizard and in the
field that is displayed, specify the path to the JAR file
that provides the connection parameters of your Hadoop
environment. Note that this file must be accessible from the
machine where you Job is launched.
This kind of Hadoop configuration JAR file is
automatically generated when you build a big data Job from Talend Studio. This JAR file is by default named with this
pattern: hadoop-conf-[name_of_the_metadata_in_the_repository]_[name_of_the_context].jar
You
can also download this JAR file from the web console of your cluster or
simply create a JAR file yourself by putting the configuration files in
the root of your JAR file. For example: hdfs-sidt.xml
core-site.xml
The parameters from your custom JAR file override the parameters
you put in the Spark configuration field.
They also override the configuration you set in the
configuration components such as
tHDFSConfiguration or
tHBaseConfiguration when the related
storage system such as HDFS, HBase or Hive are native to Hadoop.
But they do not override the configuration set in the
configuration components for the third-party storage system such
as tAzureFSConfiguration.
-
If you are accessing the Hadoop cluster running with Kerberos security,
select this check box, then, enter the Kerberos principal names for the Resource
Manager service and the Job History service in the displayed fields. This enables
you to use your username to authenticate against the credentials stored in Kerberos.
These principals can be found in the configuration files of your distribution, such
as in yarn-site.xml and in mapred-site.xml.
If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains
pairs of Kerberos principals and encrypted keys. You need to enter the principal to
be used in the Principal field and the access
path to the keytab file itself in the Keytab
field. This keytab file must be stored in the machine in which your Job actually
runs, for example, on a Talend JobServer.
Note that the user that executes a keytab-enabled Job is not necessarily
the one a principal designates but must have the right to read the keytab file being
used. For example, the username you are using to execute a Job is user1 and the principal to be used is guest; in this
situation, ensure that user1 has the right to read the keytab
file to be used.
-
The User name field is available when you
are not using Kerberos to authenticate. In the User
name field, enter the login username for your distribution. If you leave
it empty, the username of the machine hosting Talend Studio will
be used.
-
Select the Wait for the Job to
complete check box to make Talend Studio or,
if you use Talend JobServer,
your Job JVM keep monitoring the Job until the execution of the Job is over. By
selecting this check box, you actually set the
spark.yarn.submit.waitAppCompletion property to be true.
While it is generally useful to select this check box when running a Spark Batch Job,
it makes more sense to keep this check box clear when running a Spark Streaming
Job.
|