Using Spark with Talend Data Mapper - 7.2

Version
7.2
Language
English (United States)
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Content
Data Governance > Third-party systems > Processing components (Integration) > Data mapping
Data Quality and Preparation > Third-party systems > Processing components (Integration) > Data mapping
Design and Development > Third-party systems > Processing components (Integration) > Data mapping

Using Spark with Talend Data Mapper

Apache Spark (Spark) is useful when you want to treat large input files with Talend Data Mapper. You can take advantage of the speed and streaming advantages of Spark to stream the file and process the mapping without having to load the full file into memory first before performing any transformation.

If you wish to test the capabilities of Spark and Talend Data Mapper together in importing large input files, read through this scenario to learn how you can easily do it.

For more information on Apache Spark, see their official documentation at http://spark.apache.org/. For more information on Talend Data Mapper, see the Talend Data Mapper User Guide.

Prerequisites

Talend Studio contains a local Spark environment that can run Jobs. To successfully perform the following scenario, here is an example of an environment that you can configure:

  • Three instances of CentOS servers on Google Cloud Platform with Cloudera installed as a cluster with the Hadoop Distributed File System (HDFS) and Spark services enabled
  • Windows 10 Client

Connecting to a Hadoop Cluster

Create a Hadoop Cluster connection.

Procedure

  1. In Talend Studio, navigate to Repository > Metadata.
  2. Right-click Hadoop Cluster and select Create Hadoop Cluster.
  3. Enter a name for your cluster and click Next.
  4. Select your distribution, Cloudera in this example, and select the version.
  5. Select Retrieve configuration from Ambari or Cloudera and click Next.
  6. Enter your Cloudera Manager credentials.
  7. Click Connect to populate the Discovered clusters section, where you can fetch the services from.
  8. Click Next.
  9. Click Check Services to verify if all the services are activated. Then check the service status.

Creating the Talend Data Mapper structure

Create a structure for your map.

Before you begin

You have a CSV file to use as input. You can create one from the following sample:
firstName,lastName,age
John,Doe,20
Jane,Doe,35
Kid,Doe,02

Procedure

  1. Open the Mapping perspective and navigate to Data Mapper > Hierarchical Mapper.
  2. Right-click Structures and select New > Structure.
  3. Select Import a structure definition and click Next.
  4. Select CSV and click Next.
  5. Select the sample file to use and click Next.
  6. Enter the name of the structure and click Next.
  7. In the CSV properties, select the Skip header reading check box, then click Next and Finish.

Creating the Big Data Batch Job

After you have created a Hadoop Cluster and a Structure, design the Big Data Batch Job including the tHDFSConfiguration, tHMapInput, and tLogRow components.

Procedure

  1. Open the Integration perspective and navigate to Repository > Job Designs.
  2. Right-click Big Data Batch and select Create Big Data Batch Job.
  3. Enter the necessary details to create the Job.
  4. Drag the Hadoop Cluster metadata you created into the Job Design and select the tHDFSConfiguration component.
  5. Add a tHMapInput and a tLogRow and connect these using Row > Main connection.
    1. Enter Output, when prompted for the output name.
  6. Double-click the tLogRow and define its schema:
    1. Click the […] button next to Edit schema.
    2. In the Output (Input) section, click the + to add three new columns and name them firstName, lastName and age.
    3. Click the button to copy the columns to tLogRow_1 (Output).
  7. Click the tHMapInput and open the Basic Settings tab.
    1. Select the Define a storage configuration component check box and select the tHDFSConfiguration component as the chosen storage.
    2. Specify the input file in the Input field.
    3. Click the […] button next to Configure Component and select the structure you created earlier.
    4. Select CSV in the Input Representation drop-down list.
    5. Click Next and add the input file in the Sample File field, then click Run to check the number records found.
    6. Click Finish.

Configuring the map and running the Job

Map the elements from the input to the output structure and run the Job.

Procedure

  1. Drag the input row element on the output OutputRecord element.
    All the elements are automatically mapped.
  2. Click Test Run to see a preview of the output.
    In this example, it looks like this when displayed as JSON:
    {
      "Output" : {
        "OutputRecord" : [ {
          "firstName" : "firstName",
          "lastName" : "lastName",
          "age" : "age"
        }, {
          "firstName" : "John",
          "lastName" : "Doe",
          "age" : "20"
        }, {
          "firstName" : "Jane",
          "lastName" : "Doe",
          "age" : "35"
        }, {
          "firstName" : "Kid",
          "lastName" : "Doe",
          "age" : "02"
        } ]
      }
    }
  3. Save the map and go back to the Big Data Batch Job.
  4. Open the Run tab and click Run to execute the Job.

Troubleshooting

If you encounter errors while performing the sample scenario, take a look at some solutions to help you successfully run the Job.

  • Incorrect Cloudera setup: Cloudera may have set up your cluster with their internal Fully Qualified Domain Names (FQDN). If this is the case, then you may need to make an addition to your hosts file to prevent connection issues.

    To do this, navigate to C:\\Windows\System32\drivers\etc and then open the Hosts file as an Administrator. Then add your cluster's external IP address and your internal FQDN. Save the file.

    This should prompt Cloudera to use the internal FQDN.

  • Common error in any Big Data Batch Job: If you are connecting to a Hadoop Cluster that is located in a different server as your Talend Studio, then ignore the following error:
    The error simply locates winutils to run the Spark workers locally. To get rid of this error, download and extract winutils. Set your Hadoop home directory to the location where you extracted it.