Scenario: Reading a multi-structure XML file - 6.1

Talend Open Studio for Big Data Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Open Studio for Big Data
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

The following scenario describes a Job which reads a multi-structure XML file, extracts the desired fields and displays them on the console.

Designing the Job

  1. Drop a tFileInputMSXML component from the Palette onto the design workspace and double-click the component to open its Basic settings view in the Component tab.

  2. Browse to the XML file you want to process. In this example, it is D:/Input/multischema_xml.xml, which contains the following data:

    <root>
            <toy>Cat</toy>
            <record>We Belong Together</record>
            <book>As You Like It</book>
            <book>All's Well That Ends Well</book>
            <record>When You Believe</record>
            <toy>Dog</toy>
    </root>
  3. In the Root XPath query field, enter the root of the XML tree, which the query will be based on. In this example, it is "/root".

  4. Select the Enable XPath in column "Schema XPath loop" but lose the order check box.

    In this example, to extract the desired fields, you need to define a XPath path in the Schema XPath loop field in the Outputs table for each output flow while not keeping the order of the data shown in the source XML file.

  5. Click the plus button to add lines in the Outputs table where you can define the output schemas, record and book in this example.

  6. In the Outputs table, click in the Schema cell and then click a three-dot button to display a dialog box where you can define the schema name.

    Enter a name for the output schema and click OK to close the dialog box.

  7. The tFileInputMSXML schema editor appears.

    Define the schema according to your need.

  8. Do the same to define the output schema record.

  9. In the Schema XPath loop cell, enter the node of the XML tree, which the loop is based on. In this example, enter "/book" and "/record" respectively.

  10. In the XPath Queries cell, enter the fields to be extracted from the structured XML input. In this example, enter the XPath query ".".

  11. In the design workspace, drop two tLogRow compnents from the Palette and connect tFileInputMSXML to tLogRow1 and tLogRow2 using the book and record links respectively.

    Rename the two tLogRow components as book and record respectively.

Saving and executing the Job

  1. Press Ctrl+S to save your Job.

  2. Execute the Job by pressing F6 or clicking Run on the Run tab.

    The multi-structure XML file is read row by row and the extracted fields are displayed on the console. The first two fields are for the book schema, and the last two fields are for the record schema.