Using Spark with Talend Data Mapper
Apache Spark (Spark) is useful when you want to treat large input files with Talend Data Mapper. You can take advantage of the speed and streaming advantages of Spark to stream the file and process the mapping without having to load the full file into memory first before performing any transformation.
If you wish to test the capabilities of Spark and Talend Data Mapper together in importing large input files, read through this scenario to learn how you can easily do it.
For more information on Apache Spark, see their official documentation at http://spark.apache.org/. For more information on Talend Data Mapper, see the Talend Data Mapper User Guide.
Prerequisites
Talend Studio contains a local Spark environment that can run Jobs. To successfully perform the following scenario, here is an example of an environment that you can configure:
- Three instances of CentOS servers on Google Cloud Platform with Cloudera installed as a cluster with the Hadoop Distributed File System (HDFS) and Spark services enabled
- Windows 10 Client
Connecting to a Hadoop Cluster
Procedure
Creating the Talend Data Mapper structure
Create a structure for your map.
Before you begin
firstName,lastName,age
John,Doe,20
Jane,Doe,35
Kid,Doe,02
Procedure
Creating a Big Data Batch Job with an HDFS connection
After you have created a Hadoop Cluster and a Structure, design the Big Data Batch Job including the tHDFSConfiguration, tHMapInput, and tLogRow components.
Procedure
Configuring the map and running the Job
Map the elements from the input to the output structure and run the Job.
Procedure
Troubleshooting your Job
If you encounter errors while performing the sample scenario, take a look at some solutions to help you successfully run the Job.
-
Incorrect Cloudera setup: Cloudera may have set up your cluster with their internal Fully Qualified Domain Names (FQDN). If this is the case, then you may need to make an addition to your hosts file to prevent connection issues.
To do this, navigate to C:\\Windows\System32\drivers\etc and then open the Hosts file as an Administrator. Then add your cluster's external IP address and your internal FQDN. Save the file.
This should prompt Cloudera to use the internal FQDN.
-
Common error in any Big Data Batch Job: If you are connecting to a Hadoop Cluster that is located in a different server as your Talend Studio, then ignore the following error:The error simply locates winutils to run the Spark workers locally. To get rid of this error, download and extract winutils. Set your Hadoop home directory to the location where you extracted it.