How a Talend Job for Apache Spark works - 7.1

Talend Real-time Big Data Platform Studio User Guide

Talend Documentation Team
Talend Real-Time Big Data Platform
Design and Development
Talend Studio
Using the Spark-specific components, a Talend Spark Job makes use of the Spark framework to process RDDs (Resilient Distributed Datasets) on top of a given Spark cluster.

Depending on which framework you select for the Spark Job you are creating, this Talend Spark Job implements the Spark Streaming framework or the Spark framework when generating its code.

A Talend Spark Job can be run in any of the following modes:

  • Local: the Studio builds the Spark environment in itself at runtime to run the Job locally within the Studio. With this mode, each processor of the local machine is used as a Spark worker to perform the computations. This mode requires minimum parameters to be set in this configuration view.

    Note this local machine is the machine in which the Job is actually run.

  • Standalone: the Studio connects to a Spark-enabled cluster to run the Job from this cluster.

  • Yarn client: the Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly.