How a Talend Spark Job works

Talend Studio Developer Guide

EnrichVersion
Cloud
EnrichProdName
Talend Integration Cloud
task
Design and Development

Depending on which framework you select for the Spark Job you are creating, this Talend Spark Job implements the Spark Streaming framework or the Spark framework when generating its code.

A Talend Spark Job can be run in any of the following modes:

  • Local: the Studio builds the Spark environment in itself at runtime to run the Job locally within the Studio. With this mode, each processor of the local machine is used as a Spark worker to perform the computations. This mode requires minimum parameters to be set in this configuration view.

    Note this local machine is the machine in which the Job is actually run.

  • Standalone: the Studio connects to a Spark-enabled cluster to run the Job from this cluster.

  • Yarn client: the Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly.

In Talend Studio, you design a Spark Job using the dedicated Spark components and configure the connection to the cluster to be used. At runtime, this configuration allows the Studio to directly communicate with the cluster to perform the following operations:

  • submit the Spark Job to the Master server in the Standalone mode or to the ApplicationMaster server in the Yarn client mode of the cluster being used

  • and copy the related Job resources to the distributed file system of the same cluster. The cluster then completes the rest of the execution such as initializing the Job, generating Job ID and sending the execution progress information and the result back to the Studio.

Note that a Talend Spark Job is not equivalent to a Spark job explained in the Spark documentation from Apache. A Talend Spark Job generates one or several Spark jobs in terms of Apache Spark, depending on the way you design the Talend Job in the workspace of the Studio. For further information about what is a Spark job, see http://spark.apache.org/docs/latest/cluster-overview.html#glossary.

Each component in the Talend Spark Job knows how to generate the specific tasks to accomplish its mission (classes in the generated code), and how to join all the tasks together to form the Job. Each connection between components generates a given structure (an Avro record) that can contain data, and is compatible with Spark serialization. Components are optimized for their specific tasks. At runtime, these generated classes are shipped to the nodes to execute on data, the structures contain the data and are also shipped between nodes during shuffle phases, and the Talend Job itself coordinates the generated Spark jobs.

When you run the Job, some statistics information is displayed in the design workspace of the Job to indicate the progress of the Spark computations coordinated by the Job.

The following image presents an example of a Talend Spark Batch Job, in which the statistics information is displayed in green:

In this example, two Spark jobs, job 0 and job 1, are created and as you can read, are both 100% completed.

If you are running a Spark Streaming Job, the progress of the processing of the micro batches is displayed as below:

The execution information of a Talend Spark Job is logged by the HistoryServer service of the cluster be used. You can consult the web console of the service for that information. The name of the Job in the console is automatically constructed to be ProjectName_JobName_JobVersion, for example, LOCALPROJECT_wordcount_0.1.