Introduction to Talend Data Mapper on Spark - 7.1

Talend Data Fabric
Talend Studio
Administration and Monitoring
Data Governance
Data Quality and Preparation
Design and Development
Installation and Upgrade

Introduction to Talend Data Mapper on Spark

This tutorial covers getting started using Talend Data Mapper (TDM) on Spark: what is Talend Data Mapper and how does Talend Data Mapper work on Spark? This tutorial uses Talend Data Fabric 6.2 and a Cloudera CDH 5.4 cluster.
What is Talend Data Mapper?

Talend Data Mapper is an advanced data-mapping tool designed to handle complex data structures. This kind of data mapping is important for managing Electronic Data Interchange (EDI) in the healthcare industry or conveying financial information between banks using Financial products Markup Language (FpML).

TDM allows you to define and execute transformations between data records or documents.

Why TDM on Spark?

Files to be processed using TDM are becoming larger and larger. TDM on Spark was developed to overcome the limitations of TDM. It was designed to transform very large files in a short processing time. Depending on the transformation used, a 100GB file is processed in less than 30 minutes. TDM on Spark was also designed to overcome the limitations of the streaming mode of TDM, which does not support multi-output processing and allows only sequential processing.

Instead of a single Swiss-Army-Knife mapping component, called tHMap, TDM on Spark offers multiple specialized components with simpler wizards than the tHMap component:

  • tHConvertFile: converts hierarchical data from file to file
  • tHMapFile: transforms hierarchical data from file to file(s)
  • tHMapInput: transforms hierarchical data from file to component(s)

For a brief introduction to the features of TDM on Spark, view the Talend Data Mapper on Spark video .

How can TDM work on Spark?

TDM streaming is designed to run on a single node and single thread, so the processing can’t be parallelized. With TDM on Spark, the approach is different. The files are processed one record at a time and records are processed in parallel, allowing multi-node and multi-thread processing.

But this approach is valid only if the input file is regular, meaning that it is a collection of records and all records have the same schema. The other condition is that the mapping can be performed on each record separately and then the mappings are merged to build the output file.

The challenge is to split the records so that they can be processed in parallel. In TDM, regular files are structured as follows:


|-- loop(0:*)

|-- record <-- inherits from record’s structure

The splitting process works sequentially on the entire file, so it must be very fast. The method varies based on the file type. For example, to split records in a CSV file, the newline character can be used. The splitting process for XML or JSON files is based on the start tag and end tag. Alternatively, the envelopes marker is used to split EDI and SAP files.

Files stored on HDFS are automatically split, typically into 128MB blocks. But this automatic splitting does not respect the record boundaries. Observe the diagram below:



If you start with a very large file, it is split into partitions by Hadoop and then each partition is processed in parallel by Hadoop. But to create the partitions, records are not taken into account. The file is split into partitions of a predefined size (typically 128MB), regardless of the content of the file. Some logic implementation is necessary to recover the records of interest.

In each node, the partitions are further split into smaller portions, called records, by the RecordReader. Records appear in red in the diagram above. An RDD is created from this record and the mapping function is applied to it. If a record is split between two partitions, HDFS offers the ability to remotely read the end of the record. This reading is not optimal, as it is remote, but it should happen only once per partition.

Traditionally, TDM is called once to process a complete file. But with TDM on Spark, TDM runs on each record. This is the big difference between traditional TDM processing and TDM on Spark processing.

To be efficient, TDM needs a mechanism that detects the start of a record reliably and fast. This is done using a signature . The signature is defined by the users and can be tested against sample files in the wizard of the various TDM on Spark components. It is used to recognize the start of records.

Records and signatures–limitations

At this point, there are some limitations regarding records and signatures:

  • Records are assumed to follow one another, with nothing in between
  • Footers can be an issue

To continue investigating TDM on Spark, view the following articles:

  • Converting Hierarchical Data from File to File
  • Transforming Hierarchical Data from File to File
  • Transforming a Hierarchical Data File to Multiple Output Files