Apache ORC File - Import - 7.1

Talend Data Catalog Bridges

Talend Documentation Team
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Data Catalog

Note: This file format needs to be imported with the File System (CSV, Excel, XML, JSON, Avro, Parquet, ORC, COBOL Copybook), Apache Hadoop Distributed File System (HDFS Java API) or Amazon Web Services (AWS) S3 Storage bridges.

Bridge Specifications

Vendor Apache
Tool Name ORC File
Tool Version 1.5.2
Tool Web Site https://orc.apache.org/
Supported Methodology [File System] Data Store (Physical Data Model) via Java API on ORC File
Remote Repository Browsing for Model Selection
Data Profiling
Multi-Model Harvesting
Incremental Harvesting

Import tool: Apache ORC File 1.5.2 (https://orc.apache.org/)
Import interface: [File System] Data Store (Physical Data Model) via Java API on ORC File from Apache ORC File
Import bridge: 'ApacheOrc' 11.0.0

This bridge imports metadata from ORC files using a Java API.
Note that this bridge is not performing any data driven metadata discovery, but instead reading the schema definition at the header (top) of the ORC file.

This bridge detects the following standard ORC data type:
as defined in hhttps://orc.apache.org/docs/types.html

Integer: boolean (1 bit), tinyint (8 bit), smallint (16 bit), int (32 bit), bigint (64 bit)
Floating point: float, double
String types: string, char, varchar
Binary blobs: binary
Date/time: timestamp, timestamp with local time zone, date
Compound types: struct, list, map, unionn

Bridge Parameters

Parameter Name Description Type Values Default Scope
File Path to file to import FILE *.*   Mandatory
Miscellaneous Specify miscellaneous options identified with a -letter and value.

For example, -m 4G -f 100 -j -Dname=value -Xms1G

-m the maximum Java memory size whole number (e.g. -m 4G or -m 2500M ).
-v set environment variable(s) (e.g. -v var1=value -v var2="value with spaces").
-j the last option that is followed by Java command line options (e.g. -j -Dname=value -Xms1G).
-hadoop key1=val1;key2=val2 to manualy set hadoop configuration options
-tps 10 maximum threads pool size
-tl 3600s processing time limit in s -seconds m - minutes or h hours;
-fl 1000 processing files count limit;
-delimited.top_rows_skip 1 number of rows to skip while processing csv files
-delimited.extra_separators ~,||,|~ comma separated extra delimiters each of which will be used while processing csv files
-delimited.no_header by default, bridge automatically tries to detect headers while processing csv files(basing on header columns types), use this option to disable headers import(f.e. to hide sensitive data)
-fresh.partition.models - use to import latest modified files when processing partitions defined in Partitioned directories parameter
-subst K: C:/test - use to associate a root path part with a drive or another path.
-skip.download - use to disable dependencies downloading and use only download cache
-prescript [cmd] - runs a script command before bridge execution. Example: -prescript \"script.bat\"
The script must be located in the bin directory, and have .bat or .sh extension.
The script path must not include any parent directory symbol (..)
The script should return exit code 0 to indicate success, or another value to indicate failure.
-disable.partitions.autodetection - use this option to disable automatic partitions detection(when "Partition directories" option is empty)
-parquet.compressed.max.size=10000000 bridge will ignore parquet archives with size bigger then defined with this option value; default value is 10 000 000 Bytes;


Bridge Mapping

Meta Integration Repository (MIR)
(based on the OMG CWM standard)
"Apache ORC File"
File System (File)
Mapping Comments
Attribute Array Elementary Item, Field, Attribute, Array Field, Elementary Item, Fixed Width Field, Partition Field  
Name Name  
Position Position, Offset  
Class Array Element, Group Item, Array Group Item, Array Object, Element, Object, Sheet  
Name Name  
PropertyElementTypeScope UDPs  
Name Name  
Scope Scope  
PropertyType UDP  
DataType Data Type  
DesignLevel Design Level  
Name Name  
Position Position  
StoreModel Cobol File, Parquet File, Delimited File, Avro File, Json File, Collection, Orc File, Xml File, Excel File, File, Fixed Width File  
Name Name