TDM on Spark 6.2 Converting Hierarchical Data from File to File
This tutorial covers getting started using Talend Data Mapper on Spark. You will:
- Convert hierarchical data from XML to Avro, JSON, or flat representation
- Create a structure from a sample file
- Test the signature against sample files
- Run the conversion Job on a cluster
- Run the conversion Job locally
This tutorial uses Talend Data Fabric 6.2 and the Cloudera CDH 5.4 cluster.Architecture
The tHConvertFile component is designed to convert hierarchical data from file to file.
At design time, you will provide the:
- TDM structure (created or imported); for more on TDM structures, view Working with Structures
- Input and output representation (XML, JSON, COBOL, Avro); available representations are listed on the documentation page, Representations
- Signature of the input file (the signature is defined by the user and depends on the file type and content)
This example shows how to convert a hierarchical XML file to a flat, Avro, or JSON file. The file to be converted has the following structure:
Talend recommends allocating more memory to your Studio before running TDM on Spark Jobs. In your Studio .ini file, modify the maximum amount of memory to at least -Xmx4g
The first step is to create a Big Data batch Job that runs on Spark. Then you will add a tHConvertFile component:
This component works on its own and cannot be connected to any other component. Select the tHConvertFile component and open the Component view:
In the Component view, several pieces of information are required, including the storage configuration component, input file, output folder, and action on the output folder.Add a storage component
The storage configuration component is a tHDFSConfiguration component configured to connect to your cluster. If you already have cluster and HDFS metadata available in the repository, click your HDFS metadata and drag it to the designer. On the Component list, select tHDFSConfiguration:
A tHDFSConfiguration component is automatically added to the designer.Configure the execution on Spark
That means the Spark configuration in the Run view is different from the metadata used to create the tHDSFConfiguration component. The Studio offers to update the Spark configuration according to the tHDFSConfiguration component. Click OK.
The Spark configuration is now up to date and ready to run your Job, once the tHConvertFile is configured.Configure the tHConvertFile Component view
Now you can go back to the Component view of the tHConvertFile component:
The tHDFSConfiguration component appears on the storage configuration component list.
You also need to enter (or navigate to) the path to the input file and path to the output folder. When you are using a tHDFSConfiguration component, it is assumed that the input file is stored in HDFS and the output folder is there as well.
On the Action list, you can select either Create or Overwrite.
Your configuration should be similar to the following:
The storage configuration component, input file, and output folder are now configured, but the TDM conversion is not configured yet.Configure the file conversionOpen the Component Configuration wizard
Click Configure Component:
On the first page of this wizard, you provide the record structure, as well as the input and output representation. The record structure must be available in the repository in Metadata>Hierarchical Mapper>Structures.Creating the structure
The structure can be created manually or from a sample file. At this point, the structure of the file is not available in the repository, so click Cancel to close the wizard.
In the repository, in Metadata>Hierarchical Mapper>Structures, create a new structure. For this tutorial, we are using the sample file shown at the beginning, so in the New Structure wizard, keep the default option:
The wizard is configured to create the structure from a local XML sample document. When created, the structure opens and appears in the repository: