Using Spark with Talend Data Mapper
Apache Spark (Spark) is useful when you want to treat large input files with Talend Data Mapper. You can take advantage of the speed and streaming advantages of Spark to stream the file and process the mapping without having to load the full file into memory first before performing any transformation.
If you wish to test the capabilities of Spark and Talend Data Mapper together in importing large input files, read through this scenario to learn how you can easily do it.
For more information on Apache Spark, see their official documentation at http://spark.apache.org/. For more information on Talend Data Mapper, see the Talend Data Mapper User Guide.
Prerequisites
Talend Studio contains a local Spark (at least version 1.3.0) environment that can run Jobs. If you wish to test your Job locally, ensure that you have installed version 6.5 or higher of the Studio.
To successfully perform the following scenario, here is an example of an environment that you can configure:
- Three instances of Centos 7.x servers on the Google Cloud Platform: ensure that you have a Cloudera 5.13 installed as a cluster with the Hadoop Distributed File System (HDFS) and Spark services enabled
- Windows 10 Client
Connecting to a Hadoop Cluster
Creating the Talend Data Mapper Structure
Creating the Big Data Batch Job
After you have created a Hadoop Cluster and a Structure, design the Big Data Batch Job including the tHDFSConfiguration, tHMapInput, and tLogRow components.
Procedure
Testing the Map and running the Job
Test the Map that is automatically created and generated with the Talend Data Mapper Structure.
Procedure
Troubleshooting
If you encounter errors while performing the sample scenario, take a look at some solutions to help you successfully run the Job.
Incorrect Cloudera setup: Cloudera may have set up your cluster with their internal Fully Qualified Domain Names (FQDN). If this is the case, then you may need to make an addition to your hosts file to prevent connection issues.
To do this, navigate to C:\\Windows\System32\drivers\etc and then open the Hosts file as an Administrator. Then add your cluster's external IP address and your internal FQDN. Save the file.
This should prompt Cloudera to use the internal FQDN.
-
Common error in any Big Data Batch Job: If you are connecting to a Hadoop Cluster that is located in a different server as your Talend Studio, then ignore the following error:The error simply locates winutils to run the Spark workers locally. To get rid of this error, download and extract winutils. Set your Hadoop home directory to the location where you extracted it.