You can start either from the Job Designs node of the
Repository tree view in the
Integration
perspective or from Big Data Batch node under the
Job Designs node.
The two approaches are similar and thus the following procedure shows how to create a
Spark Job from the Job Designs node.
Procedure
-
Right-click the Job Designs node and in the contextual
menu, select Create Big Data Batch Job.
Then the New Big Data Batch Job wizard appears.
-
From the Framework drop-down list, select
Spark.
-
In the Name, the Purpose and the
Description fields, enter the descriptive information
accordingly. Among the information, the Job name is mandatory.
Once done, the Finish button is activated.
-
If you need to change the Job version, click the M and
the m buttons next to the Version
field to make the changes.
If you need to change the Job status, select it from the drop-down list of the
Status field.
If you need to edit the information in the uneditable fields, select
File > Edit Project
properties from the menu bar to open the Project
Settings dialog box to make the desired changes.
-
Click Finish to close the wizard and validate the
changes.
Then an empty Job is opened in the workspace of the Studio and the available
components for Spark appear in the Palette.
Results
In the Repository tree view, this created Spark Job appears
automatically under the Big Data Batch node under
Job Designs.
Then you need to drop the components you need to use from the
Palette onto the workspace and link and configure them to
design a Spark Job, the same way you do for a standard Job. You also need to set up
the connection to the Spark cluster to be used in the Spark
configuration tab of the Run view.
You can
repeat the same operations to create a Spark Streaming Job. The only different step
to take is that you need to select Create Big Data Streaming
Job from the contextual menu after right-clicking the Job
Designs node, and then you select Spark Streaming from the
Framework drop-down list in the New Big Data
Streaming Job wizard that is displayed.
After the creation of your Spark Job, you can reduce the time spent by the Job at
runtime with the lightweight dependencies option. This option reduces the number of
libraries to only the Talend libraries and thus affect how the Job runs. Also, all
the dependencies remain but they are not sent to the cluster at runtime. This could
prevent issues about dependencies conflict, missing signature, wrong JAR version or
missing JAR for example. In the
Run view, click the
Spark Configuration tab and select the
Use
lightweight dependencies check box. You can also use another
classpath, different from the Cloudera default one, by selecting the
Use
custom classpath check box and entering the JARs you want to use in
a regex syntax separated by a comma. This option is available for Amazon EMR 6.2 and
Cloudera CDH 6.1 distributions.
Note that if you need to run your Spark Job in a mode other than the
Local mode and in a distribution other than the
Universal distribution, a Storage component, typically a
tHDFSConfiguration component, is required in the same Job
so that Spark can use this component to connect to the file system to which the jar
files dependent on the Job are transferred.
You can also create these types of Jobs by writing their Job scripts in the Jobscript view
and then generate the Jobs accordingly. For more information on using Job scripts, see
Talend Job
scripts reference guide at https://help.talend.com/.