Importing Hive or HDFS datasets on a multi-node cluster

To enable import for Hive or HDFS datasets stored on a multi-node cluster, you will have to edit the Components Catalog configuration files.

Important: Make sure that your keytab file used to authenticate to HDFS is accessible to all the workers on the cluster.

Procedure

Create a <components_catalog>/tcomp_gss.conf file, and add the following configuration parameters:

com.sun.security.jgss.initiate {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=false
doNotPrompt=true
useKeyTab=true
keyTab="/path/to/the/keytab/keytab_file.keytab"
principal="your@principalHere"
debug=true;
};

In the <components_catalog>/start.sh set these parameters with the following values to reference the previously created <components_catalog>/tcomp_gss.conf file:

THE_CMD="$JAVA_BIN $SCRIPT_JAVA_OPTS -Djava.security.auth.login.config=/path/to/gss.conf -Djava.security.krb5.debug=true
-Djava.security.krb5.conf="/etc/krb5.conf" -Djavax.security.auth.useSubjectCredsOnly=false -cp
\"$APP_CLASSPATH\" $APP_CLASS $*"

When importing your dataset in Talend Data Preparation, the JDBC URL used to connect to Hive must follow this model:
jdbc:hive2://host:10000/default;principal=<your_principal>

Results

You can now import Hive or HDFS datasets stored on a multi-node cluster.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here