Importing Hive or HDFS datasets on a multi-node cluster - 6.5

Talend Data Preparation User Guide

EnrichVersion
6.5
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
EnrichPlatform
Talend Data Preparation
task
Data Quality and Preparation > Cleansing data

To enable import for Hive or HDFS datasets stored on a multi-node cluster, you will have to edit the Components Catalog configuration files.

Important: Make sure that your keytab file used to authenticate to HDFS is accessible to all the workers on the cluster.

Procedure

  1. Create a <components_catalog>/tcomp_gss.conf file, and add the following configuration parameters:
    com.sun.security.jgss.initiate {
    com.sun.security.auth.module.Krb5LoginModule required
    useTicketCache=false
    doNotPrompt=true
    useKeyTab=true
    keyTab="/path/to/the/keytab/keytab_file.keytab"
    principal="your@principalHere"
    debug=true;
    };
  2. In the <components_catalog>/start.sh set these parameters with the following values to reference the previously created <components_catalog>/tcomp_gss.conf file:
    THE_CMD="$JAVA_BIN $SCRIPT_JAVA_OPTS -Djava.security.auth.login.config=/path/to/gss.conf -Djava.security.krb5.debug=true
    -Djava.security.krb5.conf="/etc/krb5.conf" -Djavax.security.auth.useSubjectCredsOnly=false -cp
    \"$APP_CLASSPATH\" $APP_CLASS $*"
  3. When importing your dataset in Talend Data Preparation, the JDBC URL used to connect to Hive must follow this model: jdbc:hive2://host:10000/default;principal=<your_principal>

Results

You can now import Hive or HDFS datasets stored on a multi-node cluster.