Scenario: Validating data flows against an XSD file - 6.3

Talend Open Studio for Big Data Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Open Studio for Big Data
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a Job that validates an XML column in the input file ShipOrder.csv against the XSD reference file ShipOrder.xsd and then outputs valid rows into the delimited file ShipOrder_Valid.csv and invalid rows and error messages into the delimited file ShipOrder_Invalid.csv. For a similar use case that validates an XML file, see Scenario: Validating XML files.

The content of the input file ShipOrder.csv that includes the XML column ShipOrder to be validated is as follows:

ID;ShipOrder
000001;<shiporder orderid="000001"><orderperson>George Bush</orderperson><shipto><name>John Adams</name><address>Oxford Street</address></shipto><item><title>Empire Burlesque</title><note>Special Edition</note><quantity>1</quantity><price>10.90</price></item></shiporder>
000002;<shiporder orderid="000002"><orderperson>Judy Liu</orderperson><shipto><name>Jack Liu</name><address>Wangfujing Street</address></shipto><item><title>Hide Your Heart</title><quantity>1</quantity><price>9.90</price></item></shiporder>
000003;<shiporder><orderperson>Peter Qian</orderperson><shipto><name>Thomas Wang</name><address>Wangfujing Street</address></shipto><item><title>The Power of Habit</title><quantity>1</quantity><price>8.99</price></item></shiporder>

The content of the XSD reference file ShipOrder.xsd is as follows:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xs:element name="shiporder">
  <xs:complexType>
   <xs:sequence>
    <xs:element name="orderperson" type="xs:string"/>
    <xs:element name="shipto">
     <xs:complexType>
      <xs:sequence>
       <xs:element name="name" type="xs:string"/>
       <xs:element name="address" type="xs:string"/>
      </xs:sequence>
     </xs:complexType>
    </xs:element>
    <xs:element name="item" maxOccurs="unbounded">
     <xs:complexType>
      <xs:sequence>
       <xs:element name="title" type="xs:string"/>
       <xs:element name="note" type="xs:string" minOccurs="0"/>
       <xs:element name="quantity" type="xs:positiveInteger"/>
       <xs:element name="price" type="xs:decimal"/>
      </xs:sequence>
     </xs:complexType>
    </xs:element>
   </xs:sequence>
   <xs:attribute name="orderid" type="xs:string" use="required"/>
  </xs:complexType>
 </xs:element>
</xs:schema>

Adding and linking components

  1. Create a new Job and add a tFileInputDelimited component, a tXSDValidator component, and two tFileOutputDelimited components by typing their names in the design workspace or dropping them from the Palette.

  2. Double-click the tXSDValidator component to open its Basic settings view and select Flow Mode from the Mode drop-down list.

  3. Link the tFileInputDelimited component to the tXSDValidator component using a Row > Main connection.

  4. Link the tXSDValidator component to the first tFileOutputDelimited component using a Row > Main connection to output valid rows.

  5. Link the tXSDValidator component to the second tFileOutputDelimited component using a Row > Rejects connection to output invalid rows.

Configuring the components

  1. Double-click the tFileInputDelimited component to open its Basic settings view on the Component tab.

  2. In the File name/Stream field, specify the path to the input file. In this example, it is E:/ShipOrder.csv.

    In the Header field, enter 1 to skip the first header row of the input file.

    Click the [...] button next to Edit schema and define the schema by adding two columns ID and ShipOrder of String type.

  3. Double-click the tXSDValidator component to open its Basic settings view on the Component tab.

  4. Click the Sync columns button to retrieve the schema from the preceding tFileInputDelimited component, and in the pop-up dialog box, click Yes to propagate the schema to the two tFileOutputDelimited components.

    Add a row in the Allocate table by clicking the [+] button. Then click the Input Column cell and select the XML column ShipOrder to be validated from the drop-down list. And in the XSD File cell, enter the path to the XSD reference file, E:/ShipOrder.xsd in this example.

  5. Double-click the first tFileOutputDelimited component to open its Basic settings view on the Component tab.

  6. In the File Name field, specify the path to the output file that will store valid rows. In this example, it is E:/ShipOrder_Valid.csv.

    Select the Include Header check box to include column headers in the output file.

  7. Double-click the second tFileOutputDelimited component to open its Basic settings view on the Component tab.

  8. Click the [...] button next to Edit schema to view its schema.

    You can see an extra column errorMessage that holds the error information for invalid rows is added automatically into the schema in addition to the two propagated columns.

  9. In the File Name field, specify the path to the output file that will store invalid rows and error messages. In this example, it is E:/ShipOrder_Invalid.csv.

    Select the Include Header check box to include column headers in the output file.

Saving and executing the Job

  1. Press Ctrl+S to save the Job.

  2. Press F6 to run the Job.