TDM on Spark 6.2 Converting Hierarchical Data from File to File

author
Frédérique Martin Sainte-Agathe
EnrichVersion
6.2
EnrichProdName
Talend Real-Time Big Data Platform
Talend Data Fabric
Talend Big Data
Talend Big Data Platform
task
Design and Development
Installation and Upgrade
Data Quality and Preparation
Administration and Monitoring
Deployment
Data Governance
EnrichPlatform
Talend Studio

TDM on Spark 6.2 Converting Hierarchical Data from File to File

This article is an extracted section of How to activate log4j in Talend Studio? This article presents a simple Job showing how the log4j feature works. This job reads data from a text file, filters some rows based on a given condition, and finally write the data into a MySQL table.

This tutorial covers getting started using Talend Data Mapper on Spark. You will:

  • Convert hierarchical data from XML to Avro, JSON, or flat representation
  • Create a structure from a sample file
  • Test the signature against sample files
  • Run the conversion Job on a cluster
  • Run the conversion Job locally

This tutorial uses Talend Data Fabric 6.2 and the Cloudera CDH 5.4 cluster.

Architecture

The tHConvertFile component is designed to convert hierarchical data from file to file.

At design time, you will provide the:

  • TDM structure (created or imported); for more on TDM structures, view Working with Structures
  • Input and output representation (XML, JSON, COBOL, Avro); available representations are listed on the documentation page, Representations
  • Signature of the input file (the signature is defined by the user and depends on the file type and content)
Converting an XML file

This example shows how to convert a hierarchical XML file to a flat, Avro, or JSON file. The file to be converted has the following structure:

Create a Spark Big Data batch Job

Talend recommends allocating more memory to your Studio before running TDM on Spark Jobs. In your Studio .ini file, modify the maximum amount of memory to at least -Xmx4g

The first step is to create a Big Data batch Job that runs on Spark. Then you will add a tHConvertFile component:

This component works on its own and cannot be connected to any other component. Select the tHConvertFile component and open the Component view:

In the Component view, several pieces of information are required, including the storage configuration component, input file, output folder, and action on the output folder.

Add a storage component

The storage configuration component is a tHDFSConfiguration component configured to connect to your cluster. If you already have cluster and HDFS metadata available in the repository, click your HDFS metadata and drag it to the designer. On the Component list, select tHDFSConfiguration:

A tHDFSConfiguration component is automatically added to the designer.

Configure the execution on Spark
If the Spark configuration is not completed in the Run view, you get the following message:

That means the Spark configuration in the Run view is different from the metadata used to create the tHDSFConfiguration component. The Studio offers to update the Spark configuration according to the tHDFSConfiguration component. Click OK.

The Spark configuration is now up to date and ready to run your Job, once the tHConvertFile is configured.

Configure the tHConvertFile Component view

Now you can go back to the Component view of the tHConvertFile component:

The tHDFSConfiguration component appears on the storage configuration component list.

You also need to enter (or navigate to) the path to the input file and path to the output folder. When you are using a tHDFSConfiguration component, it is assumed that the input file is stored in HDFS and the output folder is there as well.

On the Action list, you can select either Create or Overwrite.

Your configuration should be similar to the following:

The storage configuration component, input file, and output folder are now configured, but the TDM conversion is not configured yet.

Configure the file conversionOpen the Component Configuration wizard

Click Configure Component:

On the first page of this wizard, you provide the record structure, as well as the input and output representation. The record structure must be available in the repository in Metadata>Hierarchical Mapper>Structures.

Creating the structure

The structure can be created manually or from a sample file. At this point, the structure of the file is not available in the repository, so click Cancel to close the wizard.

In the repository, in Metadata>Hierarchical Mapper>Structures, create a new structure. For this tutorial, we are using the sample file shown at the beginning, so in the New Structure wizard, keep the default option:

The wizard is configured to create the structure from a local XML sample document. When created, the structure opens and appears in the repository: