First steps using Big Data in Talend Studio - 8.0

Version
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development > Designing Jobs > Hadoop distributions
Last publication date
2023-09-14

First steps with Big Data in Talend Studio

In this tutorial, learn how to take your first steps with Big Data in Talend Studio.

This tutorial makes use of a Hadoop cluster. You must have a Hadoop cluster available to you.

Creating a Talend Studio project

Creating a project is the first step to using Talend Studio. Projects allow you to better organize your work.

Procedure

  1. Select Create a new project.
  2. Enter a name for your project.

    Example

    TalendDemo
  3. Click Create.
  4. Click Finish.

Results

Your project opens. You are ready to work in Talend Studio.

Creating a Job to use a Hadoop cluster connection

Talend Studio projects contain Jobs. In Jobs, you can build workflows through components, which allow you to complete specific actions.

Before you begin

Select the Integration perspective (Window > Perspective > Integration).

Procedure

  1. In Repository, right-click Job Designs.
    1. Click Create Standard Job.
  2. In the Name field, enter a name.

    Example

    ReadWriteHDFS
  3. Optional: In the Purpose field, enter a purpose.

    Example

    Read/Write data in HDFS
  4. Optional: In the Description field, enter a description.

    Example

    Standard job to write and read customers data to and from HDFS
    Tip: Enter a Purpose and Description to stay organized.
  5. Click Finish.

Results

The Designer opens an empty Job.

Creating a Hadoop cluster metadata definition

You can create a Hadoop cluster metadata definition to be able to quickly configure component with your Hadoop cluster information. Talend Studio also allows you to import a cluster metadata definition.

Before you begin

  • This tutorial makes use of a Hadoop cluster. You must have a Hadoop cluster available to you.
  • Select the Integration perspective (Window > Perspective > Integration).

Procedure

  1. In the Repository, expand Metadata, right-click Hadoop Cluster and click Create Hadoop Cluster.
  2. In the Name field, enter a name.

    Example

    MyHadoopCluster
  3. Optional: In the Purpose field, enter a purpose.

    Example

    Cluster connection metadata
  4. Optional: In the Description field, enter a description.

    Example

    Metadata to connect to a Amazon EMR cluster
    Tip: Enter a Purpose and Description to stay organized.
  5. Click Next.
  6. Select a Distribution.

    Example

    Select Amazon EMR and EMR 5.15.0 (Hadoop 2.8.3).
  7. Select a Version.

    Example

    Select EMR 5.15.0 (Hadoop 2.8.3).
  8. Select Enter manually Hadoop services.
  9. Click Finish.
    You are brought to the Hadoop Cluster Connection window.
  10. Enter your Connection details.

    Example

    • Namecode URI: hdfs://hadoopcluster:8020
    • Resource Manager: hadoopcluster:8032
    • Resource Manager Scheduler: hadoopcluster:8030
    • Job History: hadoopcluster:10020
    • Staging directory: /user
  11. Enter your Authentication details.

    Example

    • User name: student
  12. Optional: Click Check Services.
  13. Click Finish.

Results

The Hadoop cluster metadata definition appears in the Repository.

Importing a Hadoop cluster metadata definition

You can import your Hadoop cluster configuration to create a Hadoop cluster metadata definition to be able to quickly configure components with its information. Talend Studio also allows you to create a cluster metadata definition from scratch.

Before you begin

  • This tutorial makes use of a Hadoop cluster. You must have a Hadoop cluster available to you.
  • Select the Integration perspective (Window > Perspective > Integration).

Procedure

  1. In the Repository, expand Metadata, right-click Hadoop Cluster and click Create Hadoop Cluster.
  2. In the Name field, enter a name.

    Example

    MyHadoopCluster_files
  3. Optional: In the Purpose field, enter a purpose.

    Example

    Cluster connection metadata
  4. Optional: In the Description field, enter a description.

    Example

    Metadata to connect to a Cloudera CDH cluster
    Tip: Enter a Purpose and Description to stay organized.
  5. Click Next.
  6. Select a Distribution.

    Example

    Select Cloudera.
  7. Select a Version.

    Example

    Select Cloudera CDH6.1.1 [Built in].
  8. Select Import configuration from local files.
  9. Click Next.
  10. Under Location, select the file of your choice in the File Explorer.
  11. Select your modules.

    Example

    Select HDFS or YARN.
  12. Click Finish.

    Example

    You are brought to the Hadoop Cluster Connection window, and your Connection details have been entered already.
  13. Optional: Click Check Services.
  14. Click Finish.

Results

The Hadoop cluster metadata definition appears in the Repository.

Writing and reading data in HDFS

In this tutorial, discover how to write data to HDFS using automatically generated random data. Next, learn how to read data from HDFS, how to sort it and how to display the results in the console.

Generating random data

With the help of the tRowGenerator component, Talend Studio can create random data to help you test its features.

About this task

Follow the examples to create a fictional database of clients.

Procedure

  1. Add a tRowGenerator component.
    This component helps you generate random data for testing purposes.
  2. Double-click the tRowGenerator component.
    You are brought in the tRowGenerator configuration window.
  3. Click the plus button to add a Column.
    1. In the Column field, enter a name.

      Example

      1. CustomerID
      2. FirstName
      3. LastName
    2. Select the column Types.

      Example

      1. For CustomerID, select the Integer Type.
      2. For FirstName and LastName, select the String Type.
    3. Select the column Functions.

      Example

      1. For CustomerID, select the Numeric.random(int,int) function.

        This function generates random numbers.

      2. For FirstName, select the TalendDataGenerator.getFirstName() function.

        This function generates random first names.

      3. For LastName, select the TalendDataGenerator.getLastName() function.

        This function generates random last names.

  4. Optional: Configure your Columns.

    Example

    For CustomerID, in the Function parameters tab, enter a max value of 1000.
  5. Optional: Enter the number of your choice in the Number of Rows for RowGenerator field.

    Example

    Enter 1000 to create a 1000 customers.
  6. Click OK.

Results

You have configured a tRowGenerator component to generate random data. You can now use it to test other features of Talend Studio.

What to do next

Click the Preview button in the Preview tab to try this feature.

Writing data to HDFS using metadata

Using the tHDFSOutput component, you can write data to HDFS.

Before you begin

Procedure

  1. In the Repository, expand Metadata > Hadoop Cluster, then expand the Hadoop cluster metadata of your choice.
    1. Drag-and-drop the HDFS metadata onto the Designer.
      You are brought to the Components window.
    2. Select a tHDFSOutput component.
  2. Add an input component.

    Example

    Add a tRowGenerator component to generate fictional data for testing purposes (see Generating random data).
  3. Right-click the input component.
    1. Select Row > Main
    2. Click on the tHDFSOutput component to link the two.
  4. Double-click the tHDFSOutput component.

    The component is already configured with the predefined HDFS metadata connection information.

  5. In the File Name field, enter the file path and name of your choice.
  6. Optional: In Action, select Overwrite.

Results

Your input component (such as the tRowGenerator component) reads data and the tHDFSOutput component writes it to your HDFS system using a connection defined using metadata.

Reading data from HDFS using metadata

Using the tHDFSInput component, you can read data from HDFS.

Before you begin

Procedure

  1. In the Repository, expand Metadata > Hadoop Cluster, then expand the Hadoop cluster metadata of your choice.
    1. Drag-and-drop the HDFS metadata onto the Designer.
    2. Select a tHDFSInput component.
  2. Double-click the tHDFSInput component.

    The component is already configured with the predefined HDFS metadata connection information.

  3. In the File Name field, enter the file path and name of your choice.
  4. Click the […] button next to Edit schema.
  5. Click the plus button to add a Column.
    1. In the Column field, enter a name.

      Example

      1. CustomerID
      2. FirstName
      3. LastName
    2. Select the column Types.

      Example

      1. For CustomerID, select the Integer Type.
      2. For FirstName and LastName, select the String Type.
    3. Click OK.
  6. Right-click the tRowGenerator component.
    1. Select Trigger > On Subjob OK.
    2. Click on the tHDFSInput component to link the two.
  7. Add a tSortRow component.
  8. Right-click the tHDFSInput component.
    1. Select Row > Main
    2. Click on the tSortRow component to link the two.
  9. Double-click the tSortRow component.
    1. Click Sync columns.
      The tSortRow component inherits the schema from the tHDFSInput component.
  10. Click the plus button.
    The first column of your tHDFSInput component schema appears.
  11. Add a tLogRow component.
  12. Right-click the tSortRow component.
    1. Select Row > Main
    2. Click on the tLogRow component to link the two.
      This is what your Designer should look like.
  13. Double-click the tLogRow component.
    1. Select Table (print values in cells of a table).
  14. In the Run view, click Run.

Results

Your input component (such as the tRowGenerator component) provides data to the tHDFSOutput component, which writes it to your HDFS system. When this operation is complete, the tHDFSInput component reads the data, provides it to the tSortRow component, which sorts it, and the tLogRow component displays the HDFS sorted data.