Scenario 2: Extracting correct and erroneous data from an XML field in a delimited file - 6.3

Talend Open Studio for Big Data Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Open Studio for Big Data
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a four-component Job that reads an XML structure from a delimited file, outputs the main data and rejects the erroneous data.

  1. Drop the following components from the Palette to the design workspace: tFileInputDelimited, tExtractXMLField, tFileOutputDelimited and tLogRow.

    Connect the first three components using Row Main links.

    Connect tExtractXMLField to tLogRow using a Row Reject link.

  2. Double-click tFileInputDelimited to open its Basic settings view and define the component properties.

  3. Select Built-in in the Schema list and fill in the file metadata manually in the corresponding fields.

    Click the three-dot button next to Edit schema to display a dialog box where you can define the structure of your data.

    Click the plus button to add as many columns as needed to your data structure. In this example, we have one column in the schema: xmlStr.

    Click OK to validate your changes and close the dialog box.

    Note

    If you have already stored the schema in the Metadata folder under File delimited, select Repository from the Schema list and click the three-dot button next to the field to display the [Repository Content] dialog box where you can select the relevant schema from the list. Click Ok to close the dialog box and have the fields automatically filled in with the schema metadata.

    For more information about storing schema metadata in the Repository tree view, see Talend Studio User Guide.

  4. In the File Name field, click the three-dot button and browse to the input delimited file you want to process, CustomerDetails_Error in this example.

    This delimited file holds a number of simple XML lines separated by double carriage return.

    Set the row and field separators used in the input file in the corresponding fields, double carriage return for the first and nothing for the second in this example.

    If needed, set Header, Footer and Limit. None is used in this example.

  5. In the design workspace, double-click tExtractXMLField to display its Basic settings view and define the component properties.

  6. Click Sync columns to retrieve the schema from the preceding component. You can click the three-dot button next to Edit schema to view/modify the schema.

    The Column field in the Mapping table will be automatically populated with the defined schema.

  7. In the Xml field list, select the column from which you want to extract the XML data. In this example, the filed holding the XML data is called xmlStr.

    In the Loop XPath query field, enter the node of the XML tree on which to loop to retrieve data.

  8. In the design workspace, double-click tFileOutputDelimited to open its Basic settings view and display the component properties.

  9. In the File Name field, define or browse to the output file you want to write the correct data in, CustomerNames_right.csv in this example.

    Click Sync columns to retrieve the schema of the preceding component. You can click the three-dot button next to Edit schema to view/modify the schema.

  10. In the design workspace, double-click tLogRow to display its Basic settings view and define the component properties.

    Click Sync Columns to retrieve the schema of the preceding component. For more information on this component, see tLogRow.

  11. Save your Job and press F6 to execute it.

tExtractXMLField reads and extracts in the output delimited file, CustomerNames_right, the client information for which the XML structure is correct, and displays as well erroneous data on the console of the Run view.