Apache Hadoop Distributed File System (HDFS Java API) - Import - 7.1

Talend Data Catalog Bridges

Talend Documentation Team
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Data Catalog

Bridge Requirements

This bridge:
  • requires Internet access to https://repo.maven.apache.org/maven2/ and/or other tool sites to download drivers into <TDC_HOME>/data/download/MIMB/. For more information on how to retrieve third-party drivers when the TDC server cannot access the Internet, see this article.

Bridge Specifications

Vendor Apache
Tool Name Hadoop Distributed File System (HDFS)
Tool Version 2.5
Tool Web Site http://hadoop.apache.org/docs/r1.2.1/hdfs_user_guide.html
Supported Methodology [File System] Multi-Model, Data Store (NoSQL / Hierarchical, Physical Data Model) via Java API
Incremental Harvesting
Multi-Model Harvesting
Remote Repository Browsing for Model Selection
Data Profiling

Import tool: Apache Hadoop Distributed File System (HDFS) 2.5 (http://hadoop.apache.org/docs/r1.2.1/hdfs_user_guide.html)
Import interface: [File System] Multi-Model, Data Store (NoSQL / Hierarchical, Physical Data Model) via Java API from Apache Hadoop Distributed File System (HDFS Java API)
Import bridge: 'ApacheHDFS' 10.1.0

This bridge requires internet access to https://repo.maven.apache.org/maven2/ (and exceptionally a few other tool sites)
in order to download the necessary third party software libraries into $HOME/data/download/MIMB/
(such directory can be copied from another MIMB server with internet access).
By running this bridge, you hereby acknowledge responsibility for the license terms and any potential security vulnerabilities from these downloaded third party software libraries.

The bridge uses Apache Hadoop HDFS Java library (JARs) to access Hadoop file system.
The library JAR files are located in the /java/Hadoop directory.
One may specify a Configuration files directory and often that is sufficient, as the values for the other bridge parameters may be specified there.
This bridge supports the following file formats:
- Flat File (CSV)
- Open Office Excel (XSLX)
- COBOL Copybook
- JSON (JavaScript Object Notation)
- Apache Avro
- Apache Parquet
- Apache ORC

as well as the compressed versions of the above formats:
- ZIP (as a compression format, not as archive format)
- LZ4
- Snappy (as standard Snappy format, not as Hadoop native Snappy format)

Please refer to the individual parameter's tool tips for more detailed examples.

Bridge Parameters

Parameter Name Description Type Values Default Scope
Configuration files directory Directory containing core-site.xml and hdfs-site.xml for your environment.

It is an optional parameter that allows you to reuse configuration files you have and avoid specifying Hadoop connection and Kerberos security details manually using other parameters.

When you would like to specify the details manually you should leave this parameter value empty. If you specify the directory value and it does not have the configuration files the bridge exits with the error.

You can override the parameters available in the configuration files using the bridge parameters.
For example, you can override the fs.default.name file parameter using the NameNode URI bridge parameter.
NameNode URI URI of the Hadoop NameNode, like hdfs://host::8020
To access the NameNode through the WebHDFS REST interface specify 'webhdfs' protocol, like like webhdfs://host::8020
STRING   [web]hdfs://[server host]:[port]  
Root directory Enter the directory containing metadata files or specify it using browsing tool. Bridge provides up to 3 level browsing depth. REPOSITORY_MODEL      
Include filter The include folder and file filter pattern relative to the root directory.
The pattern uses extended unix glob case-sensitive expression syntax.
Here are some common examples:
*.* - include any file at the root level
*.csv - include only csv files at the root level
**.csv -include only csv files at any level
*.{csv,gz} include only csv or gz files at the root level
dir\*.csv - include only csv files in the 'dir' folder
dir\**.csv - include only csv files under 'dir' folder at any level
dir\**.* - include any file under 'dir' folder at any level
f.csv - include only f.csv under root level
**\f.csv - include only f.csv at any level
**dir\** - include all files under any 'dir' folder at any level
**dir1\dir2\** - include all files under any 'dir2' folder under any 'dir1' folder at any level
Exclude filter The exclude folder and file filter pattern relative to the root directory.
The pattern uses the same syntax as the Include filter. See it for the syntax details and examples.
Files that match the exclude filter are skipped.
When both include and exclude filters are empty all folders and files under the Root directory are included.
When the include filter is empty and the exclude one is not folders and files under the Root directory are included except ones matching the exclude filter.
Partition directories Files-based partition directories' paths.
The bridge tries to detect partitions automatically. It can take a long time when partitions have a lot of files.
You can shortcut the detection process for some or all partitions by specifying them in this parameter.
Specify the partition directory path relative to the Root directory.
Use . to specify the root directory as the partitioned directory.
Separate multiple paths with the , (or ;) character.

ETL tools can read and write to pattern-based partitions directories.
For example, ETL can read all *.csv files from a folder F. The ETL bridge representes it as the '*.csv' dataset in the 'F' folder (F/*.csv).
You can instruct this bridge to generate the matching dataset by specifying its name in square brackets after the folder name, like F[*.csv].
Similar it true for application specific partitions.
For example, ETL can write files under folder F to partition sub-folders named using the 'getDate@[yyyyMMdd]' function expression.
The result is represented as the 'getDate@[yyyyMMdd]' dataset in the 'F' folder (F/getDate@[yyyyMMdd]).
Agan, you can instruct this bridge to generate the matching dataset by specifying something like F/[getDate@[yyyyMMdd]].

You may specify additional info about partitioned directory internal structure, using [dataset name] and {partitioned column name} patterns for following cases:
For application partitions like:
use: zone/[po]/{region}/{year}/*.csv or
if partition columns names are not important. They will be stitched by positions

For custom application partitions like:
use: zone/*/{year}/[data]/*.csv, zone/*/{year}/[log]/*.txt

For file based partitions like:

use: zone/mlcs.[dataset1]_data_document_{date}.csv,zone/mlcs.[dataset2]_data_document_{date}.xml
Partition file number Number of files to scan during data-partitioning directories analyze. This parameter doesn't work when 'Partition directories' parameter is specified. NUMERIC      
Hadoop properties Custom Hadoop and HDFS configuration properties.

The bridge uses a default configuration to access a Hadoop distribution. If you need to use a custom configuration, specify its parameter values here.

For further information about the properties required by Hadoop and its related systems such as HDFS and Hive, see the documentation of the Hadoop distribution you are using or see Apache's Hadoop documentation on http://hadoop.apache.org/docs and then select the version of the documentation you want. For demonstration purposes, the links to some properties are listed below:
Typically, the HDFS-related properties can be found in the hdfs-default.xml file of your distribution, such as
Keytab file Full path to the Kerberos keytab file. The file is necessary to log into a Kerberos-enabled Hadoop system. It contains pairs of Kerberos principals and encrypted keys. You need to enter the Principal using the Principal user parameter.

The user that runs the bridge is not necessarily the one the Principal designates but must have the right to read the keytab file being used. For example, the user name you are using to run the bridge is UserA and the principal to be used is UserB; in this situation, ensure that UserA has the right to read the keytab file to be used.
Principal User principal name. See the “Keytab file” parameter documentation for details. STRING      
Username User authentication name of HDFS. Sometimes referred to as proxy name.
The parameter is only used for Kerberos authentication.
It does not impact the user which runs the bridge.
HDFS encryption key provider (KMS) The location of the KMS proxy. For example, kms://http@localhost:16000/kms.
Specify the HDFS encryption key provider only when the HDFS transparent encryption has been enabled in your cluster. Leave the value empty otherwise.
For further information about the HDFS transparent encryption and its KMS proxy, see Transparent Encryption in HDFS at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html.
Incremental import Specifies whether to import only the changes made in the source or to re-import everything (as specified in other parameters).

True - import only the changes made in the source.
False - import everything (as specified in other parameters).

An internal cache is maintained for each metadata source, which contains previously imported models. If this is the first import or if the internal cache has been deleted or corrupted, the bridge will behave as if this parameter is set to 'False'.
Miscellaneous Specify miscellaneous options identified with a -letter and value.

For example, -m 4G -f 100 -j -Dname=value -Xms1G

-m the maximum Java memory size whole number (e.g. -m 4G or -m 2500M ).
-v set environment variable(s) (e.g. -v var1=value -v var2="value with spaces").
-j the last option that is followed by Java command line options (e.g. -j -Dname=value -Xms1G).
-hadoop key1=val1;key2=val2 to manualy set hadoop configuration options
-tps 10 maximum threads pool size
-tl 3600s processing time limit in s -seconds m - minutes or h hours;
-fl 1000 processing files count limit;
-delimited.top_rows_skip 1 number of rows to skip while processing csv files
-delimited.extra_separators ~,||,|~ comma separated extra delimiters each of which will be used while processing csv files
-delimited.no_header by default, bridge automatically tries to detect headers while processing csv files(basing on header columns types), use this option to disable headers import(f.e. to hide sensitive data)
-fresh.partition.models - use to import latest modified files when processing partitions defined in Partitioned directories parameter
-subst K: C:/test - use to associate a root path part with a drive or another path.
-skip.download - use to disable dependencies downloading and use only download cache
-prescript [cmd] - runs a script command before bridge execution. Example: -prescript \"script.bat\"
The script must be located in the bin directory, and have .bat or .sh extension.
The script path must not include any parent directory symbol (..)
The script should return exit code 0 to indicate success, or another value to indicate failure.


Bridge Mapping

Mapping information is not available