TDM on Spark 6.2 Transforming Hierarchical Data from File to File

Frédérique Martin Sainte-Agathe
Talend Data Fabric
Data Quality and Preparation
Installation and Upgrade
Administration and Monitoring
Design and Development
Data Governance
Talend Studio

TDM on Spark 6.2 Transforming Hierarchical Data from File to File

This tutorial covers getting started using Talend Data Mapper on Spark.

You will:

  • Transform a hierarchical data file from JSON to XML
  • Create a structure from a sample file or existing structure
  • Test the signature against a sample file
  • Run the transformation Job on a cluster

This tutorial uses Talend Data Fabric 6.2 and the Cloudera CDH 5.4 cluster.

Single output architecture

The tHMapFile component is designed to transform hierarchical data from file to file.

At design time, you will provide the:

  • Structures of the input and output files (for more on TDM structures, view the documentation page, Working with Structures )
  • Input and output representation (XML, JSON, COBOL, Avro); available representations are listed on the documentation page, Representations
  • Signature of the input file (the signature is defined by the user and depends on the file type and content)
  • Map describing the transformation between the input and output structures (for more on maps, view Creating a Map )
Transforming a JSON file

This example shows how to transform a hierarchical JSON file to an XML file. The file to be converted has the following structure:

Create a Spark Big Data batch Job

We recommend allocating more memory to your Studio before running TDM on Spark Jobs. In your Studio .ini file, modify the maximum amount of memory to at least -Xmx4g

The first step is to create a Big Data batch Job that runs on Spark. Then you will add a tHMapFile component:

This component works on its own and cannot be connected to any other component. Select the tHMapFile component and open the Component view:

In the Component view, several pieces of information are required, including the storage configuration component, input file, output folder, and action on the output folder.

Add a storage component

The storage configuration component is a tHDFSConfiguration component configured to connect to your cluster. If you already have cluster and HDFS metadata available in the repository, click your HDFS metadata and drag it to the designer. In the Component list, select tHDFSConfiguration:

A tHDFSConfiguration component is automatically added to the designer.

Configure the execution on Spark

If the Spark configuration is not completed in the Run view, you get the following message:

That means the Spark configuration in the Run view is different from the metadata used to create the tHDSFConfiguration component. The Studio offers to update the Spark configuration according to the tHDFSConfiguration component. Click OK:

The Spark configuration is now up to date and ready to run your Job, once the tHConvertFile is configured.

Configure the tHMapFile Component view

Now you can go back to the Component view of the tHMaptFile component:

The tHDFSConfiguration component appears on the storage configuration component list.

You also need to enter (or navigate to) the path to the input file and path to the output folder. When you are using a tHDFSConfiguration component, it is assumed that the input file is stored in HDFS and the output folder is there as well.

On the Action list, you can select either Create or Overwrite.

Your configuration should be similar to the following:

The storage configuration component, input file, and output folder are now configured. However, the TDM transformation is not configured yet.

Configure the file transformationOpen the Component Configuration wizard

Click Configure Component:

On the first page of this wizard, you provide the record map. It must be available in the repository in Metadata>Hierarchical Mapper>Maps. Before creating the record map, you need the input and output structures. The map describes the transformation between the input and output structures.

Creating the input structure

The structure can be created manually or from a sample file. At this point, the structure of the file is not available in the repository, so click Cancel to close the wizard.

In the repository, in Metadata>Hierarchical Mapper>Structures, create a new structure. For this tutorial, we are using the sample file shown at the beginning, so in the New Structure wizard, keep the default option:

The wizard is configured to create the structure from a local JSON sample document. When created, the structure opens and appears in the repository: