Running a Job on Spark or YARN in Talend Studio - 8.0

Version
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
Last publication date
2023-09-14

Running a Job on Spark or YARN

In this tutorial, create a Big Data Batch Job running on Spark or YARN and read data from HDFS.

Creating a Talend Studio project

Creating a project is the first step to using Talend Studio. Projects allow you to better organize your work.

Procedure

  1. Select Create a new project.
  2. Enter a name for your project.

    Example

    TalendDemo
  3. Click Create.
  4. Click Finish.

Results

Your project opens. You are ready to work in Talend Studio.

Creating a Big Data Batch Job to use Spark or YARN

For Big Data processing, Talend Studio allows you to create Batch Jobs and Streaming Jobs running on Spark or MapReduce.

Before you begin

Select the Integration perspective (Window > Perspective > Integration).

Procedure

  1. In Repository, right-click Job Designs.
    1. Click Create Big Data Batch Job.
  2. In the Name field, enter a name.

    Example

    ReadHDFS_Spark_or_YARN
  3. Select a Framework.
    • Spark
    • MapReduce (deprecated)
  4. Optional: In the Purpose field, enter a purpose.

    Example

    Read and sort customer data
  5. Optional: In the Description field, enter a description.

    Example

    Read and sort customer data stored in HDFS from a Big Data Batch Job running on Spark or YARN
    Tip: Enter a Purpose and Description to stay organized.
  6. Click Finish.

Results

The Designer opens an empty Job.

Running a Job on Spark

In this tutorial, learn how to run a Talend Studio Job on Spark.

Configuring a HDFS connection to run on Spark

Using the tHDFSConfiguration component, you can connect your HDFS filesystem to Spark.

Before you begin

Procedure

  1. In the Repository, expand Metadata > Hadoop Cluster, then expand the Hadoop cluster metadata of your choice.
    1. Expand the HDFS folder of your Hadoop cluster metadata.
    2. Drag-and-drop the HDFS metadata onto the Designer.
    3. Select a tHDFSConfiguration component.
      The Hadoop Configuration Update Confirmation window opens.
  2. Click OK.

Results

Talend Studio updates the Spark configuration so that it corresponds to your cluster metadata.

What to do next

In the Run view, click Spark Configuration. The execution is configured with the HDFS connection metadata.

Reading data from a HDFS connection on Spark

Using predefined HDFS metadata, you can read data from a HDFS filesystem on Spark.

Before you begin

Procedure

  1. In the Designer, add an input component.

    Example

    Add a tFileInputDelimited component.
  2. Double-click the component.
    Your component is configured with the tHDFSConfiguration component information, under Storage.
  3. Click the […] button next to Edit schema.
  4. Click the plus button to add a data column.

    Example

    1. CustomerID
    2. FirstName
    3. LastName
  5. Select the column Types.

    Example

    For CustomerID, select the Integer Type.
  6. Click OK.
  7. In the File Name field, enter the file path and name of your choice.

Results

The tFileInputDelimited component is now configured to read data from HDFS on Spark.

Running a Job on YARN

In this tutorial, learn how to run a Talend Studio Job on YARN.

Configuring a HDFS connection to run on YARN

Using the tHDFSConfiguration component, you can connect your HDFS filesystem to YARN.

Before you begin

Procedure

  1. In the Repository, expand Metadata > Hadoop Cluster, then expand the Hadoop cluster metadata of your choice.
    1. Expand the HDFS folder of your Hadoop cluster metadata.
    2. Drag-and-drop the HDFS metadata onto the Designer.
    3. Select an input component.

    Example

    Select a tFileInputDelimited component.
    The Hadoop Configuration Update Confirmation window opens.
  2. Click OK.

Results

Talend Studio updates the YARN configuration so that it corresponds to your cluster metadata.

What to do next

In the Run view, click Hadoop Configuration. The execution is configured with the HDFS connection metadata.

Reading data from a HDFS connection on YARN

Using predefined HDFS metadata, you can read data from a HDFS filesystem on YARN.

Before you begin

Procedure

  1. Double-click your input component.
    Your component is configured with the HDFS metadata information.
  2. Click the […] button next to Edit schema.
  3. Click the plus button to add a data column.

    Example

    1. CustomerID
    2. FirstName
    3. LastName
  4. Select the column Types.

    Example

    For CustomerID, select the Integer Type.
  5. Click OK.
  6. In the File Name field, enter the file path and name of your choice.

Results

The tFileInputDelimited component is now configured to read data from HDFS on YARN.