Setting up XML metadata for an input file

Talend Data Management Platform Studio User Guide

EnrichVersion
6.2
EnrichProdName
Talend Data Management Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This section describes how to define a file connection and upload an XML schema for an input file. To define and upload an output file, see Setting up XML metadata for an output file.

Defining the general properties

In this step, the general metadata properties such as the Name, Purpose and Description are set.

  1. In the file metadata setup wizard, fill in the Name field, which is mandatory, and the Purpose and Description fields if you choose to do so. The information you provide in the Description field will appear as a tooltip when you move your mouse pointer over the file connection.

    Note

    When you enter the general properties of the metadata to be created, you need to define the type of connection as either input or output. It is therefore advisable to enter information that will help you distinguish between your input and output schemas.

  2. If needed, set the version and status in the Version and Status fields respectively. You can also manage the version and status of a Repository item in the [Project Settings] dialog box. For more information, see Version management and Status management respectively.

  3. If needed, click the Select button next to the Path field to select a folder under the File XML node to hold your newly created file connection. Note that you cannot select a folder if you are editing an existing connection, but you can drag and drop it to a new folder whenever you want.

  4. Click Next to select the type of metadata.

Setting the type of metadata (input)

In this step, the type of metadata is set as either input or output. For this procedure, the metadata of interest is input.

  1. In the dialog box, select Input XML.

  2. Click Next to upload the input file.

Uploading an XML file

This procedure describes how to upload an XML file to obtain the XML tree structure. To upload an XML Schema Definition (XSD) file, see Uploading an XSD file.

The example input XML file used to demonstrate this step contains some contact information, and the structure is like the following:

<contactInfo>
  <contact>
    <id>1</id>
    <firstName>Michael</firstName>
    <lastName>Jackson</lastName>
    <company>Talend</company>
    <city>Paris</city>
    <phone>2323</phone>
  </contact>
  <contact>
    <id>2</id>
    <firstName>Elisa</firstName>
    <lastName>Black</lastName>
    <company>Talend</company>
    <city>Paris</city>
    <phone>4499</phone>
  </contact>
  ...
</contactInfo>

To upload an XML file, do the following:

  1. Click Browse... and browse your directory to the XML file to be uploaded. Alternatively, enter the access path to the file.

    The Schema Viewer area displays a preview of the XML structure. You can expand and visualize every level of the file's XML tree structure.

  2. Enter the Encoding type in the corresponding field if the system does not detect it automatically.

  3. In the Limit field, enter the number of columns on which the XPath query is to be executed, or 0 if you want to run it against all of the columns.

  4. Click Next to define the schema parameters.

Uploading an XSD file

This procedure describes how to upload an XSD file to obtain the XML tree structure. To upload an XML file, see Uploading an XML file.

An XSD file is used to define the schema of XML files. The structure and element data types of the example XML file above can be described using the following XSD, which is used as the example XSD input in this section.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="contactInfo">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" ref="contact"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="contact">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="id"/>
        <xs:element ref="firstName"/>
        <xs:element ref="lastName"/>
        <xs:element ref="company"/>
        <xs:element ref="city"/>
        <xs:element ref="phone"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="id" type="xs:integer"/>
  <xs:element name="firstName" type="xs:NCName"/>
  <xs:element name="lastName" type="xs:NCName"/>
  <xs:element name="company" type="xs:NCName"/>
  <xs:element name="city" type="xs:NCName"/>
  <xs:element name="phone" type="xs:integer"/>
</xs:schema>

For more information on XML Schema, see http://www.w3.org/XML/Schema.

Note

When loading an XSD file,

  • the data will be saved in the Repository, and therefore the metadata will not be affected by the deletion or displacement of the file.

  • you can choose an element as the root of your XML tree.

To load an XSD file, do the following:

  1. Click Browse... and browse your directory to the XSD file to be uploaded. Alternatively, enter the access path to the file.

  2. In the dialog box the appears, select an element from the Root list as the root of your XML tree, and click OK.

    The Schema Viewer area displays a preview of the XML structure. You can expand and visualize every level of the file's XML tree structure.

  3. Enter the Encoding type in the corresponding field if the system does not detect it automatically.

  4. In the Limit field, enter the number of columns on which the XPath query is to be executed, or 0 if you want to run it against all of the columns.

  5. Click Next to define the schema parameters.

Defining the schema

In this step the schema parameters are set.

The schema definition window is composed of four views:

View

Description

Source Schema

Tree view of the XML file.

Target Schema

Extraction and iteration information.

Preview

Preview of the target schema, together with the input data of the selected columns displayed in the defined order.

Note

The preview functionality is not available if you loaded an XSD file.

File Viewer

Preview of the brute data.

First define an Xpath loop and the maximum number of times the loop can run. To do so:

  1. Populate the XPath loop expression field with the absolute XPath expression for the node to be iterated upon. There are two ways to do this, either:

    • enter the absolute XPath expression for the node to be iterated upon (Enter the full expression or press Ctrl+Space to use the autocompletion list),

    • drop a node from the tree view under Source schema onto the Absolute XPath expression field.

      An orange arrow links the node to the corresponding expression.

    Note

    The Xpath loop expression field is mandatory.

  2. In the Loop limit field, specify the maximum number of times the selected node can be iterated, or -1 if you want to run it against all of the rows.

  3. Define the fields to be extracted dragging the node(s) of interest from the Source Schema tree into the Relative or absolute XPath expression fields.

    Note

    You can select several nodes to drop on the table by pressing Ctrl or Shift and clicking the nodes of interest. The arrow linking an individual node selected on the Source Schema to the Fields to extract table are blue in colour. The other ones are gray.

  4. If needed, you can add as many columns to be extracted as necessary, delete columns or change the column order using the toolbar:

    • Add or delete a column using the and buttons.

    • Change the order of the columns using the and buttons.

  5. In the Column name fields, enter labels for the columns to be displayed in the schema Preview area.

  6. Click Refresh Preview to display a preview of the target schema. The fields are consequently displayed in the schema according to the defined order.

    Note

    The preview functionality is not available if you loaded an XSD file.

  7. Click Next to check and edit the end schema.

Finalizing the end schema

The schema generated displays the columns selected from the XML file and allows you to further define the schema.

  1. If needed, rename the metadata in the Name field (metadata, by default), add a Comment, and make further modifications, for example:

    • Redefine the columns by editing the relevant fields.

    • Add or delete a column using the and buttons.

    • Change the order of the columns using the and buttons.

    Make sure the data type in the Type column is correctly defined.

    For more information regarding Java data types, including date pattern, see Java API Specification.

    Below are the commonly used Talend data types:

    • Object: a generic Talend data type that allows processing data without regard to its content, for example, a data file not otherwise supported can be processed with a tFileInputRaw component by specifying that it has a data type of Object.

    • List: a space-separated list of primitive type elements in an XML Schema definition, defined using the xsd:list element.

    • Dynamic: a data type that can be set for a single column at the end of a schema to allow processing fields as VARCHAR(100) columns named either as 'Column<X>' or, if the input includes a header, from the column names appearing in the header. For more information, see Dynamic schema.

    • Document: a data type that allows processing an entire XML document without regarding to its content.

  2. If the XML file which the schema is based on has been changed, click the Guess button to generate the schema again. Note that if you have customized the schema, the Guess feature does not retain these changes.

  3. Click Finish. The new file connection, along with it schema, appears under the File XML node in the Repository tree view.

Now you can drag and drop the file connection or any schema of it from the Repository tree view onto the design workspace as a new tFileInputXML or tExtractXMLField component or onto an existing component to reuse the metadata. For further information about how to use the centralized metadata in a Job, see How to use centralized metadata in a Joband How to set a repository schema.

To modify an existing file connection, right-click it from the Repository tree view, and select Edit file xml to open the file metadata setup wizard.

To add a new schema to an existing file connection, right-click the connection from the Repository tree view and select Retrieve Schema from the contextual menu.

To edit an existing file schema, right-click the schema from the Repository tree view and select Edit Schema from the contextual menu.