tFileInputXML MapReduce properties

XML connectors

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Real-Time Big Data Platform
Talend Big Data
Talend Open Studio for Big Data
Talend MDM Platform
Talend Open Studio for MDM
Talend ESB
Talend Data Fabric
Talend Big Data Platform
Talend Data Services Platform
Talend Data Management Platform
Talend Open Studio for ESB
Talend Open Studio for Data Integration
Talend Data Integration
task
Data Governance > Third-party systems > XML components > XML connectors
Design and Development > Third-party systems > XML components > XML connectors
Data Quality and Preparation > Third-party systems > XML components > XML connectors
EnrichPlatform
Talend Studio

These properties are used to configure tFileInputXML running in the MapReduce Job framework.

The MapReduce tFileInputXML component belongs to the MapReduce family.

The component in this framework is available only if you have subscribed to one of the Talend solutions with Big Data.

Basic settings

Property type

Either Built-In or Repository.

 

Built-In: No property data stored centrally.

 

Repository: Select the repository file where the properties are stored.

The properties are stored centrally under the Hadoop Cluster node of the Repository tree.

The fields that come after are pre-filled in using the fetched data.

For further information about the Hadoop Cluster node, see the Getting Started Guide.

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to Repository. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

Folder/File

Browse to, or enter the path pointing to the data to be used in the file system.

If the path you set points to a folder, this component will read all of the files stored in that folder, for example,/user/talend/in; if sub-folders exist, the sub-folders are automatically ignored unless you define the property mapreduce.input.fileinputformat.input.dir.recursive to be true in the Hadoop properties table in the Hadoop configuration tab.

If you want to specify more than one files or directories in this field, separate each path using a comma (,).

If the file to be read is a compressed one, enter the file name with its extension; then ttFileInputXML automatically decompresses it at runtime. The supported compression formats and their corresponding extensions are:

  • DEFLATE: *.deflate

  • gzip: *.gz

  • bzip2: *.bz2

  • LZO: *.lzo

Note that you need to ensure you have properly configured the connection to the Hadoop distribution to be used in the Hadoop configuration tab in the Run view.

Element to extract

Enter the element from which you need to read the contents and the child elements of the input XML data.

The element defined in this field is used at the root node of any XPath specified within this component. This element helps define the atomic units of the XML data to be used so that however big the original document is or wherever the input is split, the rows within this element can be correctly distributed to the mapper tasks.

Note that any content outside this element is ignored and the child elements of this element cannot contain this element itself.

Loop XPath query

Node of the tree, which the loop is based on.

Note its root is the element you have defined in the Element to extract field.

Mapping

Column: Columns to map. They reflect the schema as defined in the Schema type field.

XPath Query: Enter the fields to be extracted from the structured input.

Get nodes: Select this check box to recuperate the XML content of all current nodes specified in the Xpath query list, or select the check box next to specific XML nodes to recuperate only the content of the selected nodes. These nodes are important when the output flow from this component needs to use the XML structure, for example, the Document data type.

For further information about the Document type, see Talend Studio User Guide.

Die on error

Select the check box to stop the execution of the Job when an error occurs.

Clear the check box to skip any rows on error and complete the process for error-free rows. When errors are skipped, you can collect the rows on error using a Row > Reject link.

Advanced settings

Ignore the namespaces

Select this check box to ignore name spaces.

Custom encoding

You may encounter encoding issues when you process the stored data. In that situation, select this check box to display the Encoding list.

Select the encoding from the list or select Custom and define it manually. This field is compulsory for database data handling. The supported encodings depend on the JVM that you are using. For more information, see https://docs.oracle.com.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

Usage rule

Because of the characteristics of the MapReduce framework, the Map/Reduce version of tFileInputXML does not support none of the following XML parsers: the DOM-based parsers, the SAX-based parsers and the streaming-based parsers.

In a Talend Map/Reduce Job, it is used as a start component and requires a transformation component as output link. The other components used along with it must be Map/Reduce components, too. They generate native Map/Reduce code that can be executed directly in Hadoop.

Once a Map/Reduce Job is opened in the workspace, tFileInputXML as well as the MapReduce family appears in the Palette of the Studio.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs, and non Map/Reduce Jobs.

Hadoop Connection

You need to use the Hadoop Configuration tab in the Run view to define the connection to a given Hadoop distribution for the whole Job.

This connection is effective on a per-Job basis.