Skip to main content Skip to complementary content

Using Spark with Talend Data Mapper

Apache Spark (Spark) is useful when you want to treat large input files with Talend Data Mapper. You can take advantage of the speed and streaming advantages of Spark to stream the file and process the mapping without having to load the full file into memory first before performing any transformation.

If you wish to test the capabilities of Spark and Talend Data Mapper together in importing large input files, read through this scenario to learn how you can easily do it.

For more information on Apache Spark, see their official documentation at http://spark.apache.org/. For more information on Talend Data Mapper, see the Talend Data Mapper User Guide.

Prerequisites

Talend Studio contains a local Spark environment that can run Jobs. To successfully perform the following scenario, here is an example of an environment that you can configure:

  • Three instances of CentOS servers on Google Cloud Platform with Cloudera installed as a cluster with the Hadoop Distributed File System (HDFS) and Spark services enabled
  • Windows 10 Client

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!