Importing Hive or HDFS datasets on a multi-node cluster - 8.0

Talend Data Preparation User Guide

Version
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Preparation
Content
Data Quality and Preparation > Cleansing data
Last publication date
2024-03-26

To enable import for Hive or HDFS datasets stored on a multi-node cluster, you must edit the Components Catalog configuration files.

Important: Make sure that your keytab file used to authenticate to HDFS is accessible to all the workers on the cluster.

Procedure

  1. Create a <components_catalog>/tcomp_gss.conf file, and add the following configuration parameters:
    com.sun.security.jgss.initiate {
    com.sun.security.auth.module.Krb5LoginModule required
    useTicketCache=false
    doNotPrompt=true
    useKeyTab=true
    keyTab="/path/to/the/keytab/keytab_file.keytab"
    principal="your@principalHere"
    debug=true;
    };
  2. In the <components_catalog>/start.sh set these parameters with the following values to reference the previously created <components_catalog>/tcomp_gss.conf file:
    THE_CMD="$JAVA_BIN $SCRIPT_JAVA_OPTS -Djava.security.auth.login.config=/path/to/gss.conf -Djava.security.krb5.debug=true
    -Djava.security.krb5.conf="/etc/krb5.conf" -Djavax.security.auth.useSubjectCredsOnly=false -cp
    \"$APP_CLASSPATH\" $APP_CLASS $*"
  3. When importing your dataset in Talend Data Preparation, the JDBC URL used to connect to Hive must follow this model:
    jdbc:hive2://host:10000/default;principal=<your_principal>

Results

You can now import Hive or HDFS datasets stored on a multi-node cluster.