Importing Hive or HDFS datasets on a multi-node cluster

Importing Hive or HDFS datasets on a multi-node cluster - 8.0

Talend Data Preparation User Guide

Version

8.0

Language

English

Product

Talend Big Data

Talend Big Data Platform

Talend Data Fabric

Talend Data Integration

Talend Data Management Platform

Talend Data Services Platform

Talend ESB

Talend MDM Platform

Talend Real-Time Big Data Platform

Module

Talend Data Preparation

Content

Data Quality and Preparation > Cleansing data

Last publication date

2024-03-26

To enable import for Hive or HDFS datasets stored on a multi-node cluster, you must edit the Components Catalog configuration files.

Important: Make sure that your keytab file used to authenticate to HDFS is accessible to all the workers on the cluster.

Procedure

Create a <components_catalog>/tcomp_gss.conf file, and add the following configuration parameters:

com.sun.security.jgss.initiate {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=false
doNotPrompt=true
useKeyTab=true
keyTab="/path/to/the/keytab/keytab_file.keytab"
principal="your@principalHere"
debug=true;
};

In the <components_catalog>/start.sh set these parameters with the following values to reference the previously created <components_catalog>/tcomp_gss.conf file:

THE_CMD="$JAVA_BIN $SCRIPT_JAVA_OPTS -Djava.security.auth.login.config=/path/to/gss.conf -Djava.security.krb5.debug=true
-Djava.security.krb5.conf="/etc/krb5.conf" -Djavax.security.auth.useSubjectCredsOnly=false -cp
\"$APP_CLASSPATH\" $APP_CLASS $*"

When importing your dataset in Talend Data Preparation, the JDBC URL used to connect to Hive must follow this model:
jdbc:hive2://host:10000/default;principal=<your_principal>

Results

You can now import Hive or HDFS datasets stored on a multi-node cluster.