TDM on Spark 6.2 Transforming a Hierarchical Data File to Multiple Output Files
- Transform an XML hierarchical data file to multiple output XML files
- Create a structure from a sample file or existing structure
- Create a wrapper for multiple output files
- Test the signature against a sample file
- Run the transformation Job on a cluster
This tutorial uses Talend Data Fabric 6.2 and the Cloudera CDH 5.4 cluster.Multiple output architecture
The tHMapFile component is designed to transform hierarchical data from a file to a single output file or multiple output files .
At design time, you will provide the:
- Input file representation (for example, XML, JSON, COBOL, Avro)
- Output file representation (for example, XML, JSON, COBOL, Avro); available representations are listed on the documentation page, Representations
- Structures of the input and output files (for more about TDM structures, view the documentation page, Working with Structures )
- Output wrapper structure, which includes the output structures
- Signature of the input file (the signature is defined by the user and depends on the file type and content)
- Map describing the transformation between the input structure and output wrapper structure (for more about maps, view the documentation page, Creating a Map )
This example shows how to convert a hierarchical XML file to multiple XML files. The file you will convert has the following structure:
Create a Spark Big Data batch Job
Talend recommends allocating more memory to your Studio before running TDM on Spark Jobs. In your Studio .ini file, modify the memory parameter to -Xmx4g to provide 4GB of memory.
The first step is to create a Big Data batch Job that runs on Spark. Then you will add a tHMapFile component:
This component works on its own and cannot be connected to any other component. Select the tHMapFile component and open the Component view:
You need to provide several pieces of information, including the storage configuration component, input file, output folder, and action on the output folder.Add a storage component
The storage configuration component is a tHDFSConfiguration component configured to connect to your cluster. If you already have cluster and HDFS metadata available in the repository, click your HDFS metadata and drag it to the designer. On the Components list, select tHDFSConfiguration:
A tHDFSConfiguration component is automatically added to the designer.Configure the execution on Spark
If the Spark configuration is not completed in the Run view, you get the following message:
The Studio gives you the option to update the Spark configuration according to the tHDFSConfiguration component. Click OK to update the Spark configuration:
The Spark configuration is now up to date and ready to run your Job, once the tHConvertFile is configured.Configure the tHMapFile Component view
Now you can go back to the Component view of the tHMaptFile component:
The tHDFSConfiguration component now appears on the storage configuration component list.
You also need to enter (or navigate to) the path to the input file and path to the output folder. If you are using a tHDFSConfiguration component, it is assumed that the input file is stored in HDFS and the output folder is there as well.
On the Action list, choose between Create and Overwrite.
Your configuration should be similar to the following:
Configure the file transformationOpen the Component Configuration wizard
Click Configure Component:
On the first page of this wizard, you provide the record map. It must be available in the repository in Metadata>Hierarchical Mapper>Maps. To create the record map, you first need the input and output structures. The map describes the transformation between the input and output structures.Creating the input structure
The structure can be created manually or from a sample file. The structure of the file is not available in the repository yet, so click Cancel to close the wizard.
In the repository, in Metadata>Hierarchical Mapper>Structures, create a new structure. For this tutorial, we are using the sample file shown at the beginning, so in the New Structure wizard, we keep the default option: