requires Internet access to https://repo.maven.apache.org/maven2/ and/or other tool sites to download drivers into <TDC_HOME>/data/download/MIMB/. For more information on how to retrieve third-party drivers when the TDC server cannot access the Internet, see this article.
|Tool Name||Parquet File|
|Tool Version||Parquet 1.x|
|Tool Web Site||http://parquet.apache.org/|
|Supported Methodology||[File System] Data Store (NoSQL / Hierarchical, Physical Data Model) via Java API on PARQUET File|
|Remote Repository Browsing for Model Selection|
Tool: Apache / Parquet File version Parquet 1.x via Java API on PARQUET File
Metadata: [File System] Data Store (NoSQL / Hierarchical, Physical Data Model)
Component: Parquet version 11.0.0
This bridge requires internet access to https://repo.maven.apache.org/maven2/ (and exceptionally a few other tool sites)
in order to download the necessary third party software libraries into $HOME/data/download/MIMB/
- If https fails, the bridge then tries with http.
- If a proxy is used to access internet, you must configure that proxy in the JRE (see the -j option in the Miscellaneous parameter).
- If the bridge does not have access to internet, that directory can be copied from another server with internet access.
By running this bridge, you hereby acknowledge responsibility for the license terms and any potential security vulnerabilities from these downloaded third party software libraries.
This bridge imports metadata from Parquet files using a Java API.
Note that this bridge is not performing any data driven metadata discovery, but instead reading the schema definition at the footer (bottom) of the Parquet file. Therefore, this bridge needs to load the entire Parquet file to reach the schema definition at the end.
If the Parquet file is not compressed, there are no file size limit as the bridge automatically skips the data portion until the footer (although this may take time on large Parquet files). However, if the Parquet file is compressed, then the bridge needs to download the entire file to uncompress it to start with. Therefore, in such case, there is a default file size limit of 10 MB (any bigger files will be ignored), however this limit can be increased in in the Miscellaneous parameter.
This bridge detects the following standard Parquet data types:
as defined in https://parquet.apache.org/documentation/latest
BOOLEAN: 1 bit boolean
INT32: 32 bit signed ints
INT64: 64 bit signed ints
INT96: 96 bit signed ints
FLOAT: IEEE 32-bit floating point values
DOUBLE: IEEE 64-bit floating point values
BYTE_ARRAY: arbitrarily long byte arrays.
Please refer to the individual parameter's tool tips for more detailed examples.
|File||Path to file to import||FILE||*.*||Mandatory|
|Miscellaneous||Specify miscellaneous options identified with a -option followed by a value if required:
-m <Java Memory's maximum size>
1G by default on 64bits JRE or as set in conf/conf.properties, e.g.
-j <Java Runtime Environment command line options>
This option must be the last one in the Miscellaneous parameter as all the text after -j is passed "as is" to the JRE, e.g.
-j -Dname=value -Xms1G
The following option must be set when a proxy is used to access internet (this is critical to access https://repo.maven.apache.org/maven2/ (and exceptionally a few other tool sites) in order to download the necessary third party software libraries.
-j -Dhttp.proxyHost=127.0.0.1 -Dhttp.proxyPort=3128 -Dhttps.proxyHost=127.0.0.1 -Dhttps.proxyPort=3128 -Dhttp.proxyUser=user -Dhttp.proxyPassword=pass -Dhttps.proxyUser=user -Dhttps.proxyPassword=pass
-jre <Java Runtime Environment full path name>
It can be an absolute path to javaw.exe on Windows or a link/script path on Linux, e.g.
-jre "c:\Program Files\Java\jre1.8.0_211\bin\javaw.exe"
-v <Environment variable value>
None by default, e.g.
-v var1=value1 -v var2="value2 with spaces"
-model.name <model name>
Override the model name, e.g.
-model.name "My Model Name"
-prescript <script name>
The script must be located in the bin directory, and have .bat or .sh extension.
The script path must not include any parent directory symbol (..).
The script should return exit code 0 to indicate success, or another value to indicate failure.
-prescript "script.bat arg1 arg2"
Clears the cache before the import, and therefore will run a full import without incremental harvesting.
Warning: this is a system option managed by the application calling the bridge and should not be set by users.
FILE SYSTEM OPTIONS
-tps <Processing Thread Pool Size's maximum count>
By default 1, for e.g.
-tl <Processing Time Limit duration>
No limits by default. Time can be specified in seconds, minutes, or hours, e.g.
-fl <Processing File Limit count>
No limits by default, e.g.
-hadoop <Hadoop configuration options>
None by default, e.g.
Use to import latest modified files when processing partitions defined in Partitioned directories parameter.
-subst <path> <new path>
Use to associate a root path part with a drive or another path, e.g.
-subst K: C:/test
Use to disable dependencies downloading and use only download cache
Use this option to disable automatic partitions detection(when "Partition directories" option is empty)
DELIMITED FILE OPTIONS
Delimited File's header by default, bridge automatically tries to detect headers while processing csv files(basing on header columns types), use this option to disable headers import(f.e. to hide sensitive data)
Delimited file's number of rows to skip while processing (0 by default), e.g.
-delimited.extra_separators <comma separated separators>
Delimited file's extra delimiters (separators by default are ), e.g.
PARQUET FILE OPTIONS
Ignore parquet archives with size bigger then defined with this option value (Default value is 10 000 000 bytes), e.g.
Mapping information is not available