How to use Kerberos in Talend Studio with Big Data
This article describes how to configure Talend Studio to work with a Kerberos-enabled Hadoop cluster.
The Internet is an insecure place. Applications that send an unencrypted password over the network are extremely vulnerable. Companies use firewalls but these assume that "the bad guys" are on the outside, which can be a very bad assumption.
Kerberos is an authentication system that was created by MIT as a solution to these network security problems. It provides mutual protection to both ends of a transaction and the transaction itself. The Kerberos protocol uses strong cryptography so that a client can prove its identity to a server (and vice versa) across an insecure network connection. After a client and server have used Kerberos to prove their identity, they can also encrypt all of their communications to assure privacy and data integrity as they go about their business.
For more information on Kerberos, see MIT Kerberos Documentation.
Kerberos implementation
Kerberos is mature, architecturally sound and meets the requirements of modern distributed systems.
Concepts
The Kerberos protocol name is based on the three-headed dog figure from Greek mythology known as Kerberos. The three heads of the Kerberos protocol comprise:
-
the Key Distribution Center (KDC)
-
the client user
-
the server with the desired service to access.
Key Distribution Center (KDC): A KDC is installed on the network to manage Kerberos security. It performs two service functions: the Authentication Service (AS) and the Ticket-Granting Service (TGS).
Authentication Service (AS): An AS is a network-accessible service which runs in the KDC, and which is used to authenticate callers.
Ticket Granting Service (TGS): A TGS grants access to specific services.
Workflow
The following diagram illustrates the workflow to establish a secure session between the server and client.
-
The user issues a kinit command from the client to explicitly obtain the Kerberos tickets.
-
Once successfully authenticated, the user is granted a Ticket to Get Tickets (TGT), which is valid for the local domain (realm). The TGT has an expiration period and may be renewed throughout the user logon session without re-entering the password. The AS sends the encrypted TGT with a key that only the KDC can decrypt and a session key encrypted with user’s password hash. The user then presents the TGT to the TGS portion of the KDC, to request access to the service server. The TGS on the KDC authenticates the user's TGT and creates a ticket and session key for both the client and the remote server.
-
Once the client user has the client/server service ticket, the user can establish the session with the server service. The server can decrypt the information coming indirectly from the TGS using its own long-term key with the KDC.
-
The service ticket is then used to authenticate the client user and establish a service session between the server and client. After the ticket's lifetime is exceeded, the service ticket must be renewed to use the service.
Installing and configuring Kerberos client
This section describes how to install Kerberos on your cluster and set up users for secure interactions. In the examples a Cloudera distribution of Hadoop is used.
Installing and configuring Kerberos client on Linux (Job server)
Procedure
Installing and configuring Kerberos client on Windows (Studio)
Procedure
Setting up Kerberos users
Procedure
In the following example, a principal for Cloudera user is being added.
Importing Hadoop metadata
Before connecting the Talend Studio to the Hadoop cluster, it is a good practice to import Hadoop metadata using the wizard or manually.
Detailed information can be found in Centralizing a Hadoop Connection.
This procedure shows how to import Hadoop metadata for Cloudera using the Import Wizard.
Procedure
Connecting the Client (Studio/Job server) to the cluster
Connections to a Kerberos-enabled cluster can be created using either kinit or keytab. The following sections describe how to perform each setup, and test the connection by writing some data to the cluster.
Ensure that the client machine on which the Talend Studio is installed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.
Connecting to the cluster using kinit
With the 'kinit' method, the user who executes the Job must be authenticated against Kerberos. Below is a sample setup.
Setting up a connection using kinit
Procedure
Results
The NameNode URI field and the Namenode principal fields will be automatically filled out from the connection information.
Writing data to the secured Hadoop Cluster
In the following example, tFixedFlowInput is used to read records and tHDFSOutput is used to write records to the Hadoop cluster.
Reading Input Data
Procedure
Writing Data to Hadoop
Procedure
Connecting to the cluster using keytab
A keytab is a file containing pairs of Kerberos principals and encrypted keys. These keys are derived from the Kerberos password. You can use this file to log into Kerberos without being prompted for a password. The most common use of keytab files is to allow scripts to authenticate to Kerberos without human interaction, or to store a password in a plain text file.
Using a keytab, the user which executes the Talend Job doesn't have to do a 'kinit'. In addition, depending on the keytab content, the user which executes the Talend Job can impersonate another user. For this reason, keytab should only be transferred securely on the file system, and access should be limited to only those processes that require it.
Below is a sample setup to generate a keytab and to use it in a Talend Job to load data.
Setting up a connection
Procedure
Writing data to the secured Hadoop Cluster
In the following example, tFixedFlowInput is used to read records and tHDFSOutput is used to write records to the Hadoop cluster.
Reading Input Data
Procedure
Setting up Cloudera
This setup is to create a home directory for a kuser1 user in Hadoop HDFS. This is done by using the Cloudera web based administration tools.
More information about the Cloudera setup can be found in Mapping Kerberos Principals to Short Names in the Cloudera documentation.
More information about mapping principals to hdfs users can be found in Configuring the Mapping from Kerberos Principals to Short Names in the Cloudera documentation.