tExtractXMLField - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tExtractXMLField reads an input XML field of a file or a database table and extracts desired data.

Purpose

tExtractXMLField opens an input XML field, reads the XML structured data directly without having first to write it out to a temporary file, and finally sends data as defined in the schema to the following component via a Row link.

If you have subscribed to one of the Talend solutions with Big Data, this component is available in the following types of Jobs:

tExtractXMLField properties

Component family

XML

 

Basic settings

Property type

Either Built-In or Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

 

 

Built-In: No property data stored centrally.

 

 

Repository: Select the repository file where the properties are stored.

When this file is selected, the fields that follow are pre-filled in using fetched data.

 

Schema type and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

 

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

XML field

Name of the XML field to be processed.

Related topic: see Talend Studio User Guide.

 

Loop XPath query

Node of the XML tree, which the loop is based on.

 

Mapping

Column: reflects the schema as defined by the Schema type field.

XPath Query: Enter the fields to be extracted from the structured input.

Get nodes: Select this check box to recuperate the XML content of all current nodes specified in the Xpath query list or select the check box next to specific XML nodes to recuperate only the content of the selected nodes.

 

Limit

Maximum number of rows to be processed. If Limit is 0, no rows are read or processed.

 

Die on error

Select this check box to stop the execution of the Job when an error occurs.

Clear the check box to skip any rows on error and complete the process for error-free rows. When errors are skipped, you can collect the rows on error using a Row > Reject link.

Advanced settings

Ignore the namespaces

Select this check box to ignore namespaces when reading and extracting the XML data.

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at a Job level as well as at each component level.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

NB_LINE: the number of rows processed. This is an After variable and it returns an integer.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is an intermediate component. It needs an input and an output components.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Limitation

n/a

Scenario 1: Extracting XML data from a field in a database table

This three-component scenario allows to read the XML structure included in the fields of a database table and then extracts the data.

  1. Drop the following components from the Palette onto the design workspace: tMysqlInput, tExtractXMLField, and tFileOutputDelimited.

    Connect the three components using Main links.

  2. Double-click tMysqlInput to display its Basic settings view and define its properties.

  3. If you have already stored the input schema in the Repository tree view, select Repository first from the Property Type list and then from the Schema list to display the [Repository Content] dialog box where you can select the relevant metadata.

    For more information about storing schema metadata in the Repository tree view, see Talend Studio User Guide.

    If you have not stored the input schema locally, select Built-in in the Property Type and Schema fields and enter the database connection and the data structure information manually. For more information about tMysqlInput properties, see tMysqlInput.

  4. In the Table Name field, enter the name of the table holding the XML data, customerdetails in this example.

    Click Guess Query to display the query corresponding to your schema.

  5. Double-click tExtractXMLField to display its Basic settings view and define its properties.

  6. Click Sync columns to retrieve the schema from the preceding component. You can click the three-dot button next to Edit schema to view/modify the schema.

    The Column field in the Mapping table will be automatically populated with the defined schema.

  7. In the Xml field list, select the column from which you want to extract the XML data. In this example, the filed holding the XML data is called CustomerDetails.

    In the Loop XPath query field, enter the node of the XML tree on which to loop to retrieve data.

    In the Xpath query column, enter between inverted commas the node of the XML field holding the data you want to extract, CustomerName in this example.

  8. Double-click tFileOutputDelimited to display its Basic settings view and define its properties.

  9. In the File Name field, define or browse to the path of the output file you want to write the extracted data in.

    Click Sync columns to retrieve the schema from the preceding component. If needed, click the three-dot button next to Edit schema to view the schema.

  10. Save your Job and click F6 to execute it.

tExtractXMLField read and extracted the clients names under the node CustomerName of the CustomerDetails field of the defined database table.

Scenario 2: Extracting correct and erroneous data from an XML field in a delimited file

This scenario describes a four-component Job that reads an XML structure from a delimited file, outputs the main data and rejects the erroneous data.

  1. Drop the following components from the Palette to the design workspace: tFileInputDelimited, tExtractXMLField, tFileOutputDelimited and tLogRow.

    Connect the first three components using Row Main links.

    Connect tExtractXMLField to tLogRow using a Row Reject link.

  2. Double-click tFileInputDelimited to open its Basic settings view and define the component properties.

  3. Select Built-in in the Schema list and fill in the file metadata manually in the corresponding fields.

    Click the three-dot button next to Edit schema to display a dialog box where you can define the structure of your data.

    Click the plus button to add as many columns as needed to your data structure. In this example, we have one column in the schema: xmlStr.

    Click OK to validate your changes and close the dialog box.

    Note

    If you have already stored the schema in the Metadata folder under File delimited, select Repository from the Schema list and click the three-dot button next to the field to display the [Repository Content] dialog box where you can select the relevant schema from the list. Click Ok to close the dialog box and have the fields automatically filled in with the schema metadata.

    For more information about storing schema metadata in the Repository tree view, see Talend Studio User Guide.

  4. In the File Name field, click the three-dot button and browse to the input delimited file you want to process, CustomerDetails_Error in this example.

    This delimited file holds a number of simple XML lines separated by double carriage return.

    Set the row and field separators used in the input file in the corresponding fields, double carriage return for the first and nothing for the second in this example.

    If needed, set Header, Footer and Limit. None is used in this example.

  5. In the design workspace, double-click tExtractXMLField to display its Basic settings view and define the component properties.

  6. Click Sync columns to retrieve the schema from the preceding component. You can click the three-dot button next to Edit schema to view/modify the schema.

    The Column field in the Mapping table will be automatically populated with the defined schema.

  7. In the Xml field list, select the column from which you want to extract the XML data. In this example, the filed holding the XML data is called xmlStr.

    In the Loop XPath query field, enter the node of the XML tree on which to loop to retrieve data.

  8. In the design workspace, double-click tFileOutputDelimited to open its Basic settings view and display the component properties.

  9. In the File Name field, define or browse to the output file you want to write the correct data in, CustomerNames_right.csv in this example.

    Click Sync columns to retrieve the schema of the preceding component. You can click the three-dot button next to Edit schema to view/modify the schema.

  10. In the design workspace, double-click tLogRow to display its Basic settings view and define the component properties.

    Click Sync Columns to retrieve the schema of the preceding component. For more information on this component, see tLogRow.

  11. Save your Job and press F6 to execute it.

tExtractXMLField reads and extracts in the output delimited file, CustomerNames_right, the client information for which the XML structure is correct, and displays as well erroneous data on the console of the Run view.

tExtractXMLField in Talend Map/Reduce Jobs

Warning

The information in this section is only for users that have subscribed to one of the Talend solutions with Big Data and is not applicable to Talend Open Studio for Big Data users.

In a Talend Map/Reduce Job, tExtractXMLField, as well as the other Map/Reduce components preceding it, generates native Map/Reduce code. This section presents the specific properties of tExtractXMLField when it is used in that situation. For further information about a Talend Map/Reduce Job, see Talend Big Data Getting Started Guide.

Component family

XML

 

Basic settings

Property type

Either Built-In or Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

 

 

Built-In: No property data stored centrally.

 

 

Repository: Select the repository file where the properties are stored.

When this file is selected, the fields that follow are pre-filled in using fetched data.

 

Schema type and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

 

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

XML field

Name of the XML field to be processed.

Related topic: see Talend Studio User Guide.

 

Loop XPath query

Node of the XML tree, which the loop is based on.

 

Mapping

Column: reflects the schema as defined by the Schema type field.

XPath Query: Enter the fields to be extracted from the structured input.

Get nodes: Select this check box to recuperate the XML content of all current nodes specified in the Xpath query list or select the check box next to specific XML nodes to recuperate only the content of the selected nodes.

 

Die on error

Select this check box to stop the execution of the Job when an error occurs.

Clear the check box to skip any rows on error and complete the process for error-free rows. When errors are skipped, you can collect the rows on error using a Row > Reject link.

Advanced settings

Ignore the namespaces

Select this check box to ignore namespaces when reading and extracting the XML data.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage in Map/Reduce Jobs

In a Talend Map/Reduce Job, this component is used as an intermediate step and other components used along with it must be Map/Reduce components, too. They generate native Map/Reduce code that can be executed directly in Hadoop.

For further information about a Talend Map/Reduce Job, see the sections describing how to create, convert and configure a Talend Map/Reduce Job of the Talend Big Data Getting Started Guide.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs, and non Map/Reduce Jobs.

Related scenarios

No scenario is available for the Map/Reduce version of this component yet.

tExtractXMLField properties in Spark Batch Jobs

Component family

XML

 

Basic settings

Property type

Either Built-In or Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

 

 

Built-In: No property data stored centrally.

 

 

Repository: Select the repository file where the properties are stored.

When this file is selected, the fields that follow are pre-filled in using fetched data.

 

Schema type and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

 

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

XML field

Name of the XML field to be processed.

Related topic: see Talend Studio User Guide.

 

Loop XPath query

Node of the XML tree, which the loop is based on.

 

Mapping

Column: reflects the schema as defined by the Schema type field.

XPath Query: Enter the fields to be extracted from the structured input.

Get nodes: Select this check box to recuperate the XML content of all current nodes specified in the Xpath query list or select the check box next to specific XML nodes to recuperate only the content of the selected nodes.

 

Die on error

Select this check box to stop the execution of the Job when an error occurs.

Advanced settings

Ignore the namespaces

Select this check box to ignore namespaces when reading and extracting the XML data.

Usage in Spark Batch Jobs

In a Talend Spark Batch Job, this component is used as an intermediate step and other components used along with it must be Spark Batch components, too. They generate native Spark Batch code that can be executed directly in a Spark cluster.

This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Spark Connection

You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, one and only one file system related component from the Storage family is required in the same Job so that Spark can use this component to connect to the file system to which the jar files dependent on the Job are transferred:

This connection is effective on a per-Job basis.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Related scenarios

No scenario is available for the Spark Batch version of this component yet.

tExtractXMLField properties in Spark Streaming Jobs

Warning

The streaming version of this component is available in the Palette of the studio on the condition that you have subscribed to Talend Real-time Big Data Platform or Talend Data Fabric.

Component family

XML

 

Basic settings

Property type

Either Built-In or Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

 

 

Built-In: No property data stored centrally.

 

 

Repository: Select the repository file where the properties are stored.

When this file is selected, the fields that follow are pre-filled in using fetched data.

 

Schema type and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

 

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

XML field

Name of the XML field to be processed.

Related topic: see Talend Studio User Guide.

 

Loop XPath query

Node of the XML tree, which the loop is based on.

 

Mapping

Column: reflects the schema as defined by the Schema type field.

XPath Query: Enter the fields to be extracted from the structured input.

Get nodes: Select this check box to recuperate the XML content of all current nodes specified in the Xpath query list or select the check box next to specific XML nodes to recuperate only the content of the selected nodes.

 

Die on error

Select this check box to stop the execution of the Job when an error occurs.

Clear the check box to skip any rows on error and complete the process for error-free rows. When errors are skipped, you can collect the rows on error using a Row > Reject link.

Advanced settings

Ignore the namespaces

Select this check box to ignore namespaces when reading and extracting the XML data.

Usage in Spark Streaming Jobs

In a Talend Spark Streaming Job, this component is used as an intermediate step and other components used along with it must be Spark Streaming components, too. They generate native Spark Streaming code that can be executed directly in a Spark cluster.

This component, along with the Spark Streaming component Palette it belongs to, appears only when you are creating a Spark Streaming Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Spark Connection

You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, one and only one file system related component from the Storage family is required in the same Job so that Spark can use this component to connect to the file system to which the jar files dependent on the Job are transferred:

This connection is effective on a per-Job basis.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Related scenarios

No scenario is available for the Spark Streaming version of this component yet.