W3C XML - Import - 7.1

Talend Data Catalog Bridges

author
Talend Documentation Team
EnrichVersion
7.1
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
EnrichPlatform
Talend Data Catalog

Note: This file format needs to be imported with the File System (CSV, Excel, XML, JSON, Avro, Parquet, ORC, COBOL Copybook), Apache Hadoop Distributed File System (HDFS Java API) or Amazon Web Services (AWS) S3 Storage bridges.

Bridge Specifications

Vendor World Wide Web Consortium
Tool Name XML
Tool Version 1.0
Tool Web Site http://www.w3.org/TR/2000/REC-xml-20001006
Supported Methodology [File System] Data Store (NoSQL / Hierarchical, Physical Data Model) via XML File
Multi-Model Harvesting
Incremental Harvesting
Data Profiling
Remote Repository Browsing for Model Selection

SPECIFICATIONS
Tool: World Wide Web Consortium XML version 1.0 via XML File
See http://www.w3.org/TR/2000/REC-xml-20001006
Metadata: [File System] Data Store (NoSQL / Hierarchical, Physical Data Model)
Bridge: W3cXml version 11.0.0

OVERVIEW
This W3C XML import bridge is used in conjunction with other file import bridges (e.g. CSV, XLSX, Json, Avro, Parquet) by all data lake / file crawler import bridges (e.g. File systems, Amazon S3, Hadoop HDFS).

The purpose of this XML import is to reverse engineer a model/schema from its content, when such XML was not formally defined by an XML Schema (XSD or DTD).
Such XML files are common from IoT devices uploaded into a data lake.

Nevertheless, such XML files are expected to be fully W3C compliant, especially with respect to the XML text declaration, well-formed parsed entities, and character encoding of entities.
See W3C standards for more details:
https://www.w3.org/TR/xml/#sec-TextDecl

Warning, you must use the dedicated XML based import bridges for all other needs such as:
- other standard W3C XML import bridges (e.g. DTD, XSD, WSDL, OWL/RDL)
- tool specific XML import bridges (e.g. Erwin Data Modeler XML, Informatica PowerCenter XML)


Bridge Parameters

Parameter Name Description Type Values Default Scope
File The bridge uses the XML file as input. FILE *.xml   Mandatory
Miscellaneous Specify miscellaneous options identified with a -option followed by a value if required:

GENERAL OPTIONS
-m <Java Memory's maximum size>
1G by default on 64bits JRE or as set in conf/conf.properties, e.g.
-m 8G
-m 2500M

-j <Java Runtime Environment command line options>
This option must be the last one in the Miscellaneous parameter as all the text after -j is passed "as is" to the JRE, e.g.
-j -Dname=value -Xms1G

-jre <Java Runtime Environment full path name>
It can be an absolute path to javaw.exe on Windows or a link/script path on Linux, e.g.
-jre "c:\Program Files\Java\jre1.8.0_211\bin\javaw.exe"

-v <Environment variable value>
None by default, e.g.
-v var1=value1 -v var2="value2 with spaces"

-model.name <model name>
Override the model name, e.g.
-model.name "My Model Name"

-prescript <script name>
The script must be located in the bin directory, and have .bat or .sh extension.
The script path must not include any parent directory symbol (..).
The script should return exit code 0 to indicate success, or another value to indicate failure.
For example:
-prescript \"script.bat\"

FILE SYSTEM OPTIONS
-tps <Processing Thread Pool Size's maximum count>
By default 1, for e.g.
-tps 10

-tl <Processing Time Limit duration>
No limits by default. Time can be specified in seconds, minutes, or hours, e.g.
-tl 3600s
-tl 60m
-tl 1h

-fl <Processing File Limit count>
No limits by default, e.g.
-fl 100

-hadoop <Hadoop configuration options>
None by default, e.g.
-hadoop key1=val1;key2=val2

-fresh.partition.models
Use to import latest modified files when processing partitions defined in Partitioned directories parameter.

-subst <path> <new path>
Use to associate a root path part with a drive or another path, e.g.
-subst K: C:/test

-skip.download
Use to disable dependencies downloading and use only download cache

-disable.partitions.autodetection
Use this option to disable automatic partitions detection(when "Partition directories" option is empty)

DELIMITED FILE OPTIONS
-delimited.no_header
Delimited File's header by default, bridge automatically tries to detect headers while processing csv files(basing on header columns types), use this option to disable headers import(f.e. to hide sensitive data)

-delimited.top_rows_skip <number>
Delimited file's number of rows to skip while processing (0 by default), e.g.
-delimited.top_rows_skip 1

-delimited.extra_separators <comma separated separators>
Delimited file's extra delimiters (separators by default are ), e.g.
-delimited.extra_separators ~,||,|~

PARQUET FILE OPTIONS
-parquet.compressed.max.size=<value>
Ignore parquet archives with size bigger then defined with this option value (Default value is 10 000 000 bytes), e.g.
-parquet.compressed.max.size=10000000
STRING      

 

Bridge Mapping

Mapping information is not available