Apache Hadoop HiveQL Script - Import - 7.1

Talend Data Catalog Bridges

Talend Documentation Team
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Data Catalog

Bridge Requirements

This bridge:
  • requires Internet access to https://repo.maven.apache.org/maven2/ and/or other tool sites to download drivers into <TDC_HOME>/data/download/MIMB/. For more information on how to retrieve third-party drivers when the TDC server cannot access the Internet, see this article.

Bridge Specifications

Vendor Apache
Tool Name Hadoop Hive Database
Tool Version 0.13
Tool Web Site http://hive.apache.org/
Supported Methodology [Data Integration] Multi-Model, Data Store (Physical Data Model), (Source and Target Data Stores, Transformation Lineage, Expression Parsing) via SQL TXT File
Incremental Harvesting
Multi-Model Harvesting
Data Profiling
Remote Repository Browsing for Model Selection

Import tool: Apache Hadoop Hive Database 0.13 (http://hive.apache.org/)
Import interface: [Data Integration] Multi-Model, Data Store (Physical Data Model), (Source and Target Data Stores, Transformation Lineage, Expression Parsing) via SQL TXT File from Apache Hadoop Hive Database SQL DML (DI/ETL) Script (HiveQL)
Import bridge: 'SqlScriptApacheHiveQL' 11.0.0

This bridge requires internet access to https://repo.maven.apache.org/maven2/ (and exceptionally a few other tool sites)
in order to download the necessary third party software libraries into $HOME/data/download/MIMB/
(such directory can be copied from another MIMB server with internet access).
By running this bridge, you hereby acknowledge responsibility for the license terms and any potential security vulnerabilities from these downloaded third party software libraries.

WARNING: This database SQL script import bridge should only be used for database external SQL scripts scheduled on regular basis typically for loading the database. Do not use this bridge for all the DDL SQL scripts used to create (or update) the database schemas, packages, tables, views, stored procedures, etc. (as they heavily depend on each other). Instead, use the dedicated live database import via JDBC which will generate a complete and detailed data flow lineage integrating all transformations with stored procedures, views, etc. (which might have been created by many such DDL SQL scripts).
The purpose of this HiveQL SQL script import bridge is to detect and parse all its embedded SQL statements in order to generate the exact scope (data models) of the involved source and target data stores, as well as the data flow lineage and impact analysis (data integration ETL/ELT model) between them.

Bridge Parameters

Parameter Name Description Type Values Default Scope
Directory Select a directory with the textual files that contain scripts to import STRING     Mandatory
Include filter The include folder and file filter pattern relative to the root directory.
The patern uses extended unix glob case-sensitive expression syntax.
Here are some common examples:
*.* - include any file at the root level
*.csv - include only csv files at the root level
**.csv -include only csv files at any level
*.{csv,gz} include only csv or gz files at the root level
dir\*.csv - include only csv files in the 'dir' folder
dir\**.csv - include only csv files under 'dir' folder at any level
dir\**.* - include any file under 'dir' folder at any level
f.csv - include only f.csv under root level
**\f.csv - include only f.csv at any level
**dir\** - include all files under any 'dir' folder at any level
**dir1\dir2\** - include all files under any 'dir2' folder under any 'dir1' folder at any level
Exclude filter The exclude folder and file filter pattern relative to the root directory.
The patern uses the same syntax as the Include filter. See it for the systax details and examples.
Files that match the exclude filter are skipped.
When both include and exclude filters are empty all folders and files under the Root directory are included.
When the include filter is empty and the exclude one is not folders and files under the Root directory are included except ones matching the exclude filter.
Hadoop configuration directory Directory containing copies of core-site.xml and hdfs-site.xml files from compatible with the remote cluster you are trying to access. DIRECTORY      
Incremental import Specifies whether to import only the changes made in the source or to re-import everything (as specified in other parameters).

True - import only the changes made in the source.
False - import everything (as specified in other parameters).

An internal cache is maintained for each metadata source, which contains previously imported models. If this is the first import or if the internal cache has been deleted or corrupted, the bridge will behave as if this parameter is set to 'False'.
Miscellaneous Specify miscellaneous options identified with a -letter and value.

For example, -s c:\values.txt -e UTF-16 -d schema

-s: path to a file that resolves Shell parameters in either Windows (%param%) or in Linux (${param}, $1) format. This parameter can be used to define a path to the key/value pair yaml file. The path can be escaped with double quotes if it contains spaces or any special characters. The records from the file will be used to preprocess the scripts and replace the corresponding Shell parameters with the actual values. The key literals must not be decorated with the escape characters and the matching rules are case sensitive. Character colon ':' is used as a key/value pair delimiter and must be escaped with backward slash '\' if it is part of the parameter name. For example, for script 'INSERT INTO %SCHEMA1%.t1(c1) SELECT a from %SCHEMA2%.t2;' the file with the parameters can be organized in the following way:
# common shell parameter map
SCHEMA1: actual_schema1
SCHEMA2: actual_schema2
# individual script maps
${table_name}: actual_table
${year_var}: 1993

If the bridge doesn't find yaml file then it generates new one and fills it with pairs of keys/default values.
Yaml file contains "common shell parameter map" section and "individual script maps" section. The bridge takes common key/value pairs to substitute shell parameters by its values in all scripts. The section "individual script maps" contains pairs for individual scripts.
-d: default schema. Allows to specify a schema name for the objects that don't have it defined explicitly.
-e: encoding. This value will be used to load text from the specified script files. By default, UTF-8 will be used. Here are some other possible values: UTF-16, UTF-16BE, US-ASCII.
-m the maximum Java memory size whole number (e.g. -m 4G or -m 2500M ).
-pppd: enables the DI/ETL post-processor processing of DI/ETL designs in order to create the design connections and connection data sets.
-j the last option that is followed by Java command line options (e.g. -j -Dname=value -Xms1G).
-cs * - create separate connections for all database schemas
-cs c1, c2 - create separate connections for all database schemas of 'c1' and 'c2' connections
-cs app1=c.s1 - create 'app1' connection for the 's1' schema in the 'c' connection
-prescript [cmd] - runs a script command before bridge execution. Example: -prescript "script.bat"
The script must be located in the bin directory, and have .bat or .sh extension.
The script path must not include any parent directory symbol (..)
The script should return exit code 0 to indicate success, or another value to indicate failure.


Bridge Mapping

Mapping information is not available