tXSDValidator - 6.3

Talend Open Studio for Big Data Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Open Studio for Big Data
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tXSDValidator validates an input XML file or an input XML flow against an XSD file and sends the validation log to the defined output.

Purpose

Helps at controlling data and structure quality of the file or flow to be processed.

tXSDValidator Properties

Component family

XML

 

Basic settings

Mode

Select the validation mode from the drop-down list.

  • File Mode: to validate an input file.

  • Flow Mode: to validate an input flow.

 

Schema and Edit schema

A schema is a row description. It defines the number of fields to be processed and passed on to the next component.

Note that when File Mode is selected from the Mode list, the schema of this component is read-only and it contains standard information regarding the file validation.

 

XSD file

Specify the path to the XSD reference file. The HTTP URL is also supported, for example, http://localhost:8080/book.xsd.

This field is available only when File Mode is selected from the Mode drop-down list.

 

XML file

Specify the path to the XML file to be validated.

This field is available only when File Mode is selected from the Mode drop-down list.

 

If XML is valid, display

Type in the message to be displayed on the console if the XML file is valid.

This field is available only when File Mode is selected from the Mode drop-down list.

 

If XML is invalid, display

Type in the message to be displayed on the console if the XML file is invalid.

This field is available only when File Mode is selected from the Mode drop-down list.

 

Print to console

Select this check box to display the validation message on the console.

This check box is available only when File Mode is selected from the Mode drop-down list.

 

Allocate

Click the [+] button to add as many rows as needed, and in each row set the value of the following columns:

  • Input Column: click the cell and select a column to be validated.

  • XSD File: enter the path to the corresponding XSD reference file.

This table is available only when Flow Mode is selected from the Mode drop-down list.

Advanced settings

Enable Features

Click the [+] button to add as many rows as needed, and in each row enter the feature to be enabled on the underlying parser between double quotation marks, for example, "http://apache.org/xml/features/honour-all-schemaLocations".

For more information about the features, see https://xerces.apache.org/xerces2-j/features.html.

 

Encoding

Enter the encoding type between double quotation marks.

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

DIFFERENCE: the result of the validation. This is a Flow variable and it returns a string.

VALID: the validation result. This is a Flow variable and it returns a boolean.

XSD_ERROR_MESSAGE: the xsd error message generated by the component. This is a Flow variable and it returns a string.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

When File Mode is selected, this component can be used as a standalone component but it is usually linked to an output component to gather the log data.

Limitation

n/a

Scenario: Validating data flows against an XSD file

This scenario describes a Job that validates an XML column in the input file ShipOrder.csv against the XSD reference file ShipOrder.xsd and then outputs valid rows into the delimited file ShipOrder_Valid.csv and invalid rows and error messages into the delimited file ShipOrder_Invalid.csv. For a similar use case that validates an XML file, see Scenario: Validating XML files.

The content of the input file ShipOrder.csv that includes the XML column ShipOrder to be validated is as follows:

ID;ShipOrder
000001;<shiporder orderid="000001"><orderperson>George Bush</orderperson><shipto><name>John Adams</name><address>Oxford Street</address></shipto><item><title>Empire Burlesque</title><note>Special Edition</note><quantity>1</quantity><price>10.90</price></item></shiporder>
000002;<shiporder orderid="000002"><orderperson>Judy Liu</orderperson><shipto><name>Jack Liu</name><address>Wangfujing Street</address></shipto><item><title>Hide Your Heart</title><quantity>1</quantity><price>9.90</price></item></shiporder>
000003;<shiporder><orderperson>Peter Qian</orderperson><shipto><name>Thomas Wang</name><address>Wangfujing Street</address></shipto><item><title>The Power of Habit</title><quantity>1</quantity><price>8.99</price></item></shiporder>

The content of the XSD reference file ShipOrder.xsd is as follows:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xs:element name="shiporder">
  <xs:complexType>
   <xs:sequence>
    <xs:element name="orderperson" type="xs:string"/>
    <xs:element name="shipto">
     <xs:complexType>
      <xs:sequence>
       <xs:element name="name" type="xs:string"/>
       <xs:element name="address" type="xs:string"/>
      </xs:sequence>
     </xs:complexType>
    </xs:element>
    <xs:element name="item" maxOccurs="unbounded">
     <xs:complexType>
      <xs:sequence>
       <xs:element name="title" type="xs:string"/>
       <xs:element name="note" type="xs:string" minOccurs="0"/>
       <xs:element name="quantity" type="xs:positiveInteger"/>
       <xs:element name="price" type="xs:decimal"/>
      </xs:sequence>
     </xs:complexType>
    </xs:element>
   </xs:sequence>
   <xs:attribute name="orderid" type="xs:string" use="required"/>
  </xs:complexType>
 </xs:element>
</xs:schema>

Adding and linking components

  1. Create a new Job and add a tFileInputDelimited component, a tXSDValidator component, and two tFileOutputDelimited components by typing their names in the design workspace or dropping them from the Palette.

  2. Double-click the tXSDValidator component to open its Basic settings view and select Flow Mode from the Mode drop-down list.

  3. Link the tFileInputDelimited component to the tXSDValidator component using a Row > Main connection.

  4. Link the tXSDValidator component to the first tFileOutputDelimited component using a Row > Main connection to output valid rows.

  5. Link the tXSDValidator component to the second tFileOutputDelimited component using a Row > Rejects connection to output invalid rows.

Configuring the components

  1. Double-click the tFileInputDelimited component to open its Basic settings view on the Component tab.

  2. In the File name/Stream field, specify the path to the input file. In this example, it is E:/ShipOrder.csv.

    In the Header field, enter 1 to skip the first header row of the input file.

    Click the [...] button next to Edit schema and define the schema by adding two columns ID and ShipOrder of String type.

  3. Double-click the tXSDValidator component to open its Basic settings view on the Component tab.

  4. Click the Sync columns button to retrieve the schema from the preceding tFileInputDelimited component, and in the pop-up dialog box, click Yes to propagate the schema to the two tFileOutputDelimited components.

    Add a row in the Allocate table by clicking the [+] button. Then click the Input Column cell and select the XML column ShipOrder to be validated from the drop-down list. And in the XSD File cell, enter the path to the XSD reference file, E:/ShipOrder.xsd in this example.

  5. Double-click the first tFileOutputDelimited component to open its Basic settings view on the Component tab.

  6. In the File Name field, specify the path to the output file that will store valid rows. In this example, it is E:/ShipOrder_Valid.csv.

    Select the Include Header check box to include column headers in the output file.

  7. Double-click the second tFileOutputDelimited component to open its Basic settings view on the Component tab.

  8. Click the [...] button next to Edit schema to view its schema.

    You can see an extra column errorMessage that holds the error information for invalid rows is added automatically into the schema in addition to the two propagated columns.

  9. In the File Name field, specify the path to the output file that will store invalid rows and error messages. In this example, it is E:/ShipOrder_Invalid.csv.

    Select the Include Header check box to include column headers in the output file.

Saving and executing the Job

  1. Press Ctrl+S to save the Job.

  2. Press F6 to run the Job.