Using Spark with Talend Data Mapper
Apache Spark (Spark) is useful when you want to treat large input files with Talend Data Mapper. You can take advantage of the speed and streaming advantages of Spark to stream the file and process the mapping without having to load the full file into memory first before performing any transformation.
If you wish to test the capabilities of Spark and Talend Data Mapper together in importing large input files, read through this scenario to learn how you can easily do it.
For more information on Apache Spark, see their official documentation at http://spark.apache.org/. For more information on Talend Data Mapper, see the Talend Data Mapper User Guide.
Talend Studio contains a local Spark (at least version 1.3.0) environment that can run Jobs. If you wish to test your Job locally, ensure that you have installed version 6.5 or higher of the Studio.
To successfully perform the following scenario, here is an example of an environment that you can configure:
- Three instances of Centos 7.x servers on the Google Cloud Platform: ensure that you have a Cloudera 5.13 installed as a cluster with the Hadoop Distributed File System (HDFS) and Spark services enabled
- Windows 10 Client
Connecting to a Hadoop Cluster
- In Talend Studio, navigate to .
- Right-click Hadoop Cluster and select Create Hadoop Cluster.
- Provide information in the fields provided.
- When prompted to select an import option, specify Retrieve configuration from Ambari or Cloudera.
Provide your Cloudera manager credentials.
- Click Connect to populate the Discovered clusters section, where you can fetch the services from.
- Click Next.
- Click Check Services to verify if all the services are activated. Then check the service status.
Creating the Talend Data Mapper Structure
Create a Structure for your map.
- Switch to the Mapping perspective and navigate to .
- Right-click Structures and select .
- Select Import a structure definition.
- Select CSV.
Specify the file that contains the input records in the Local
file field. In this example, use
- Enter the name of the structure and click Finish to create the schema based on the input file.
Creating the Big Data Batch Job
After you have created a Hadoop Cluster and a Structure, design the Big Data Batch Job including the tHDFSConfiguration, tHMapInput, and tLogRow components.
- Switch to the Integration perspectice and navigate to .
Right-click Big Data Batch and select Create
Big Data Batch Job.
- Provide the necessary details to create the Job.
- Drag the Hadoop Cluster metadata you created into the Job Design and select the tHDFSConfiguration component.
Add thMapInput and tLogRow and
connect these using connection.
- Enter OUTPUT, when prompted for the output name.
Select tHMapInput to open the Basic
- Mark the Define a storage configuration component check box and select the tHDFSConfiguration component as the chosen storage.
- Specify the input file in the Input field.
- Select Configure Component and choose the structure you created earlier.
- Select Flat from the Input Representation dropdown menu.
Click Next and add the input file in the
Sample from File System field.
Testing the Map and running the Job
Test the Map that is automatically created and generated with the Talend Data Mapper Structure.
- Open the map and drag the elements that you wish to be included in the Output.
Click Test Run.
- Go back to the Big Data Batch Job.
Click Run to execute it.
If you encounter errors while performing the sample scenario, take a look at some solutions to help you successfully run the Job.
Incorrect Cloudera setup: Cloudera may have set up your cluster with their internal Fully Qualified Domain Names (FQDN). If this is the case, then you may need to make an addition to your hosts file to prevent connection issues.
To do this, navigate to C:\\Windows\System32\drivers\etc and then open the Hosts file as an Administrator. Then add your cluster's external IP address and your internal FQDN. Save the file.
This should prompt Cloudera to use the internal FQDN.
Common error in any Big Data Batch Job: If you are connecting to a Hadoop Cluster that is located in a different server as your Talend Studio, then ignore the following error:The error simply locates winutils to run the Spark workers locally. To get rid of this error, download and extract winutils. Set your Hadoop home directory to the location where you extracted it.