Using Spark with Talend Data Mapper - 7.1

author
Talend Documentation Team
EnrichVersion
7.1
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Processing components (Integration) > Data mapping
Data Quality and Preparation > Third-party systems > Processing components (Integration) > Data mapping
Design and Development > Third-party systems > Processing components (Integration) > Data mapping

Using Spark with Talend Data Mapper

Apache Spark (Spark) is useful when you want to treat large input files with Talend Data Mapper. You can take advantage of the speed and streaming advantages of Spark to stream the file and process the mapping without having to load the full file into memory first before performing any transformation.

If you wish to test the capabilities of Spark and Talend Data Mapper together in importing large input files, read through this scenario to learn how you can easily do it.

For more information on Apache Spark, see their official documentation at http://spark.apache.org/. For more information on Talend Data Mapper, see the Talend Data Mapper User Guide.

Prerequisites

Talend Studio contains a local Spark (at least version 1.3.0) environment that can run Jobs. If you wish to test your Job locally, ensure that you have installed version 6.5 or higher of the Studio.

To successfully perform the following scenario, here is an example of an environment that you can configure:

  • Three instances of Centos 7.x servers on the Google Cloud Platform: ensure that you have a Cloudera 5.13 installed as a cluster with the Hadoop Distributed File System (HDFS) and Spark services enabled
  • Windows 10 Client

Connecting to a Hadoop Cluster

Create a Hadoop Cluster connection.

Procedure

  1. In Talend Studio, navigate to Repository > Metadata.
  2. Right-click Hadoop Cluster and select Create Hadoop Cluster.
  3. Provide information in the fields provided.
  4. When prompted to select an import option, specify Retrieve configuration from Ambari or Cloudera.
  5. Provide your Cloudera manager credentials.
  6. Click Connect to populate the Discovered clusters section, where you can fetch the services from.
  7. Click Next.
  8. Click Check Services to verify if all the services are activated. Then check the service status.

Creating the Talend Data Mapper Structure

Create a Structure for your map.

Procedure

  1. Switch to the Mapping perspective and navigate to Data Mapper > Hierarchical Mapper.
  2. Right-click Structures and select Menu > New Structure.
  3. Select Import a structure definition.
  4. Select CSV.
  5. Specify the file that contains the input records in the Local file field. In this example, use raw.txt.
  6. Enter the name of the structure and click Finish to create the schema based on the input file.

Creating the Big Data Batch Job

After you have created a Hadoop Cluster and a Structure, design the Big Data Batch Job including the tHDFSConfiguration, tHMapInput, and tLogRow components.

Procedure

  1. Switch to the Integration perspectice and navigate to Repository > Job Designs.
  2. Right-click Big Data Batch and select Create Big Data Batch Job.
  3. Provide the necessary details to create the Job.
  4. Drag the Hadoop Cluster metadata you created into the Job Design and select the tHDFSConfiguration component.
  5. Add thMapInput and tLogRow and connect these using Row > Main connection.
    1. Enter OUTPUT, when prompted for the output name.
  6. Select tHMapInput to open the Basic Settings tab.
    1. Mark the Define a storage configuration component check box and select the tHDFSConfiguration component as the chosen storage.
    2. Specify the input file in the Input field.
    3. Select Configure Component and choose the structure you created earlier.
    4. Select Flat from the Input Representation dropdown menu.
    5. Click Next and add the input file in the Sample from File System field.

Testing the Map and running the Job

Test the Map that is automatically created and generated with the Talend Data Mapper Structure.

Procedure

  1. Open the map and drag the elements that you wish to be included in the Output.
  2. Click Test Run.
  3. Go back to the Big Data Batch Job.
  4. Click Run to execute it.

Troubleshooting

If you encounter errors while performing the sample scenario, take a look at some solutions to help you successfully run the Job.

  • Incorrect Cloudera setup: Cloudera may have set up your cluster with their internal Fully Qualified Domain Names (FQDN). If this is the case, then you may need to make an addition to your hosts file to prevent connection issues.

    To do this, navigate to C:\\Windows\System32\drivers\etc and then open the Hosts file as an Administrator. Then add your cluster's external IP address and your internal FQDN. Save the file.

    This should prompt Cloudera to use the internal FQDN.

  • Common error in any Big Data Batch Job: If you are connecting to a Hadoop Cluster that is located in a different server as your Talend Studio, then ignore the following error:
    The error simply locates winutils to run the Spark workers locally. To get rid of this error, download and extract winutils. Set your Hadoop home directory to the location where you extracted it.