Hive is one of the many databases that can be added to the list of data sources available for Talend Data Preparation.
The section Adding a new database type explains how to add new JDBC drivers to enrich the list of databases available from Talend Data Preparation. However, this specific example focuses on how to configure a direct connection from your Hive database to Talend Data Preparation. An additional configuration step allows you to secure this connection with Kerberos.
Before you begin
Procedure
-
In your Components Catalog
installation folder, open the file
config/settings.xml.
By default, the Components Catalog installation folder is located at <TDP_installation_folder>/services/.
-
Add the Cloudera repository to settings.xml.
<settings> <profile> <id>cloudera</id> <activation> <activeByDefault>true</activeByDefault> </activation> <repositories> <repository> <id>cloudera</id> <name>Cloudera repository</name> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> <layout>default</layout> </repository> </repositories> </profile> </settings>
-
Open
<components_catalog_path>/config/jdbc_config.json
and add the Hive driver.
Compatible versions mentioned may be subject to change, use the following code as an example only. For more information, see Cloudera documentation.
{ "id": "Hive", "class": "org.apache.hive.jdbc.HiveDriver", "url": "jdbc:hive2://host:10000/default;principal=<your principal>", "paths": [ {"path": "mvn:org.apache.hive/hive-jdbc/3.1.2"}, {"path": "mvn:commons-el/commons-el/1.0"}, {"path": "mvn:org.datanucleus/datanucleus-core/4.1.6"}, {"path": "mvn:asm/asm-commons/5.0.1"}, {"path": "mvn:tomcat/jasper-compiler/5.5.23"}, {"path": "mvn:org.apache.derby/derby/10.14.1.0"}, {"path": "mvn:jline/jline/2.12"}, {"path": "mvn:org.apache.commons/commons-compress/1.11"}, {"path": "mvn:com.fasterxml.jackson.core/jackson-annotations/2.9.5"}, {"path": "mvn:org.apache.hive/hive-metastore/2.1.1-cdh6.1.1"}, {"path": "mvn:org.apache.hive/hive-shims/0.23-2.1.1-cdh6.1.1"}, {"path": "mvn:org.apache.hive/hive-shims/2.1.1-cdh6.1.1"}, {"path": "mvn:joda-time/joda-time/2.9.9"}, {"path": "mvn:org.codehaus.jackson/jackson-mapper-asl/1.9.13-cloudera.1"}, {"path": "mvn:com.google.code.findbugs/jsr305/3.0.0"}, {"path": "mvn:org.apache.zookeeper/zookeeper/3.4.6"}, {"path": "mvn:commons-pool/commons-pool/1.6"}, {"path": "mvn:org.apache.avro/avro/1.8.2-cdh6.1.1"}, {"path": "mvn:com.twitter/parquet-hadoop-bundle/1.9.0-cdh6.1.1"}, {"path": "mvn:com.sun.jersey/jersey-servlet/1.19"}, {"path": "mvn:commons-dbcp/commons-dbcp/1.4"}, {"path": "mvn:org.slf4j/slf4j-api/1.7.25"}, {"path": "mvn:javax.servlet.jsp/jsp-api/2.1"}, {"path": "mvn:com.codahale.metrics/metrics-jvm/3.1.5"}, {"path": "mvn:com.thoughtworks.paranamer/paranamer/2.8"}, {"path": "mvn:tomcat/jasper-runtime/5.5.23"}, {"path": "mvn:com.fasterxml.jackson.core/jackson-databind/2.9.5"}, {"path": "mvn:asm/asm-tree/5.0.4"}, {"path": "mvn:com.codahale.metrics/metrics-core/3.2.1"}, {"path": "mvn:com.sun.jersey/jersey-core/1.19"}, {"path": "mvn:org.apache.hive/hive-service/2.1.1-cdh6.1.1"}, {"path": "mvn:org.jamon/jamon-runtime/2.4.1"}, {"path": "mvn:com.sun.jersey/jersey-server/1.19"}, {"path": "mvn:org.apache.commons/commons-lang3/3.8.1"}, {"path": "mvn:com.codahale.metrics/metrics-json/3.2.1"}, {"path": "mvn:org.apache.commons/commons-configuration2/2.1.1"}, {"path": "mvn:org.apache.hive/hive-common/2.1.1-cdh6.1.1"}, {"path": "mvn:org.apache.curator/curator-client/4.0.0"}, {"path": "mvn:org.apache.thrift/libfb303/0.9.3"}, {"path": "mvn:org.apache.thrift/libthrift/0.9.3"}, {"path": "mvn:net.sf.opencsv/opencsv/2.3"}, {"path": "mvn:commons-lang/commons-lang/2.6"}, {"path": "mvn:com.fasterxml.jackson.core/jackson-core/2.9.5"}, {"path": "mvn:org.tukaani/xz/1.6"}, {"path": "mvn:com.jolbox/bonecp/0.8.0.RELEASE"}, {"path": "mvn:org.apache.httpcomponents/httpcore/4.4.10"}, {"path": "mvn:org.apache.hive/hive-serde/2.1.1-cdh6.1.1"}, {"path": "mvn:commons-cli/commons-cli/1.4"}, {"path": "mvn:com.google.guava/guava/14.0.1"}, {"path": "mvn:org.apache.httpcomponents/httpclient/4.5.6"}, {"path": "mvn:commons-codec/commons-codec/1.11"}, {"path": "mvn:log4j/log4j/1.2-api-2.17.1"}, {"path": "mvn:org.apache.ant/ant/1.9.1"}, {"path": "mvn:org.datanucleus/datanucleus-rdbms/4.1.7"}, {"path": "mvn:javax.transaction/jta/1.1"}, {"path": "mvn:commons-logging/commons-logging/1.2"}, {"path": "mvn:javax.servlet/servlet-api/2.5"}, {"path": "mvn:org.apache.ant/ant-launcher/1.9.1"}, {"path": "mvn:net.sf.jpam/jpam/1.1"}, {"path": "mvn:org.codehaus.jackson/jackson-core-asl/1.9.13"}, {"path": "mvn:org.datanucleus/datanucleus-api-jdo/4.2.1"}, {"path": "mvn:org.apache.hive.shims/hive-shims-common/2.1.1-cdh6.1.1"}, {"path": "mvn:javax.jdo/jdo-api/3.0.1"}, {"path": "mvn:org.xerial.snappy/snappy-java/1.1.7.1"}, {"path": "mvn:org.apache.curator/curator-framework/4.0.0"}, {"path": "mvn:asm/asm/5.0.4"}, {"path": "mvn:org.apache.hadoop/hadoop-yarn-server-resourcemanager:3.0.0-cdh6.1.1"}, {"path": "mvn:org.apache.hadoop/hadoop-yarn-server-applicationhistoryservice:3.0.0-cdh6.1.1"}, {"path": "mvn:org.apache.hadoop/hadoop-annotations/3.0.0-cdh6.1.1"}, {"path": "mvn:org.apache.hadoop/hadoop-yarn-common/3.0.0-cdh6.1.1"}, {"path": "mvn:org.apache.hadoop/hadoop-yarn-api/3.0.0-cdh6.1.1"}, {"path": "mvn:org.apache.hadoop/hadoop-yarn-server-common/3.0.0-cdh6.1.1"}, {"path": "mvn:org.apache.hadoop/hadoop-yarn-server-web-proxy/3.0.0-cdh6.1.1"}, {"path": "mvn:org.apache.hadoop/hadoop-common/3.0.0-cdh6.1.1"}, {"path": "mvn:org.apache.hadoop/hadoop-auth/3.0.0-cdh6.1.1"}, {"path": "mvn:org.apache.hive/hive-service-rpc/2.1.1-cdh6.1.1"}, {"path": "mvn:com.fasterxml.woodstox/woodstox-core/5.1.0"}, {"path": "mvn:org.codehaus.woodstox/stax2-api/3.1.4"} ] }
-
Open <components_catalog_path>/start.sh and add the
following to the system properties:
javax.security.auth.useSubjectCredsOnly=false
.THE_CMD="$JAVA_BIN $JAVA_OPTS - Djavax.security.auth.useSubjectCredsOnly=false -cp \"$APP_CLASSPATH\" $APP_CLASS $*"
-
Open
<components_catalog_path>/config/application.properties
and configure the
krb5.config
to point to the location where the Components Catalog server is installed:Example
krb5.config=/etc/krb5.conf
-
Create a file named sun.conf in
/config/org/talend/daikon/sandbox/properties/.
This file is needed to allow the Hive component to access specific system properties.Warning: If the directories in the path org/talend/daikon/sandbox/properties/ do not exist in <components_catalog_path>/config, create them.
-
Add the following content in sun.conf.
# # This file contains all Sun/Oracle specific system properties # java.runtime.name sun.boot.library.path java.vm.version java.vm.vendor java.vendor.url path.separator java.vm.name file.encoding.pkg sun.java.launcher user.country sun.os.patch.level java.vm.specification.name user.dir java.runtime.version java.awt.graphicsenv java.endorsed.dirs os.arch java.io.tmpdir line.separator java.vm.specification.vendor os.name sun.jnu.encoding java.library.path java.specification.name java.class.version sun.management.compiler os.version user.home user.timezone java.awt.printerjob idea.launcher.bin.path file.encoding java.specification.version java.class.path user.name java.vm.specification.version sun.java.command java.home sun.arch.data.model user.language java.specification.vendor java.vm.info java.version java.ext.dirs sun.boot.class.path sun.java.command java.home sun.arch.data.model user.language java.specification.vendor java.vm.info java.version java.ext.dirs sun.boot.class.path java.vendor file.separator java.vendor.url.bug sun.io.unicode.encoding sun.cpu.endian sun.desktop sun.cpu.isalist java.security.krb5.conf sun.security.krb5.debug java.security.krb5.kdc java.security.krb5.realm java.security.auth.login.config javax.security.auth.useSubjectCredsOnly
- Restart the Components Catalog service.
Results
In Talend Data Preparation, the Hive database is now available in the database dataset import form, in the Database type drop-down list.
When exporting a preparation made on data stored on your Hive database, you can choose to process the data on the Talend Data Preparation server.
For more information on how to import data from a database, see Adding a dataset from a database.