How to use Kerberos in Talend Studio with Big Data v6.x

EnrichVersion
6.5
EnrichProdName
Talend Open Studio for Big Data
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs > Hadoop distributions
Design and Development > Third-party systems > Authentication components > Kerberos components
Data Governance > Third-party systems > Authentication components > Kerberos components
Data Quality and Preparation > Third-party systems > Authentication components > Kerberos components
EnrichPlatform
Talend Studio

How to use Kerberos in Talend Studio with Big Data v6.x

This article describes how to configure Talend Studio to work with a Kerberos-enabled Hadoop cluster.

The Internet is an insecure place. Applications that send an unencrypted password over the network are extremely vulnerable. Companies use firewalls but these assume that "the bad guys" are on the outside, which can be a very bad assumption.

Kerberos is an authentication system that was created by MIT as a solution to these network security problems. It provides mutual protection to both ends of a transaction and the transaction itself. The Kerberos protocol uses strong cryptography so that a client can prove its identity to a server (and vice versa) across an insecure network connection. After a client and server have used Kerberos to prove their identity, they can also encrypt all of their communications to assure privacy and data integrity as they go about their business.

For more information on Kerberos, see MIT Kerberos Documentation.

Kerberos implementation

Kerberos is mature, architecturally sound and meets the requirements of modern distributed systems.

Concepts

The Kerberos protocol name is based on the three-headed dog figure from Greek mythology known as Kerberos. The three heads of the Kerberos protocol comprise:

  • the Key Distribution Center (KDC)

  • the client user

  • the server with the desired service to access.

Key Distribution Center (KDC): A KDC is installed on the network to manage Kerberos security. It performs two service functions: the Authentication Service (AS) and the Ticket-Granting Service (TGS).

Authentication Service (AS): An AS is a network-accessible service which runs in the KDC, and which is used to authenticate callers.

Ticket Granting Service (TGS): A TGS grants access to specific services.

Workflow

The following diagram illustrates the workflow to establish a secure session between the server and client.

  1. The user issues a kinit command from the client to explicitly obtain the Kerberos tickets.

  2. Once successfully authenticated, the user is granted a Ticket to Get Tickets (TGT), which is valid for the local domain (realm). The TGT has an expiration period and may be renewed throughout the user logon session without re-entering the password. The AS sends the encrypted TGT with a key that only the KDC can decrypt and a session key encrypted with user’s password hash. The user then presents the TGT to the TGS portion of the KDC, to request access to the service server. The TGS on the KDC authenticates the user's TGT and creates a ticket and session key for both the client and the remote server.

  3. Once the client user has the client/server service ticket, the user can establish the session with the server service. The server can decrypt the information coming indirectly from the TGS using its own long-term key with the KDC.

  4. The service ticket is then used to authenticate the client user and establish a service session between the server and client. After the ticket's lifetime is exceeded, the service ticket must be renewed to use the service.

Installing and configuring Kerberos client

This section describes how to install Kerberos on your cluster and set up users for secure interactions. In the examples a Cloudera distribution of Hadoop is used.

Installing and configuring Kerberos client on Linux (Job server)

This procedure shows how to install and configure the Kerberos client on Linux.

Procedure

  1. Run the yum command yum install krb5-workstation to install the Kerberos client.
  2. Update the security policies of the JRE to be used, using the patch from Oracle's Java download site.

    This update is necessary due to import control restrictions of the U.S. For further information, please consult the README.txt file in the downloaded file.

    To make this update, you can simply make a copy of the original JCE policy files (US_export_policy.jar and local_policy.jar in JAVA_HOME/lib/security) for backup and then replace the original JCE policy files with the corresponding policy files contained in the downloaded jce_policy-6.0.zip file.

  3. Configure Kerberos on a local machine by editing the /etc/krbf.conf file.

    A sample file can be found in this MIT configuration doc. More information about the parameters can be found in this MIT documentation.

Installing and configuring Kerberos client on Windows (Studio)

This procedure shows how to install and configure the Kerberos client on Windows.

Procedure

  1. Download Kerberos for Windows release 4.0.1 and install it from this MIT distribution site.
  2. Update the security policies of the JRE to be used using the patch from Oracle's Java download site.

    This update is necessary due to import control restrictions of the U.S. For further information, please consult the README.txt file in the downloaded file.

    To make this update, you can simply make a copy of the original JCE policy files (US_export_policy.jar and local_policy.jar in JAVA_HOME\lib\security) for backup and then replace the original JCE policy files with the corresponding policy files contained in the downloaded jce_policy-6.0.zip file.

  3. Configure Kerberos on a local machine by setting up the krb5.ini file.

    A sample file can be found in this MIT configuration doc.

    The file location can be specified by system property java.security.krb5.conf. If this property is not set, Java will try to locate this file in these locations (ordered by):

    1. %JAVA_HOME%/lib/security/krb5.conf

    2. %WINDOWS_ROOT%/krb5.ini

    Alternatively, the system properties java.security.krb5.realm and java.security.krb5.kdc can be used.

Setting up Kerberos users

After Kerberos is installed and configured on the cluster, the Cluster Administrator needs to create client users.

Procedure

Set up client users as principals with long-term keys in the KDC using the kadmin.local command line.

In the following example, a principal for Cloudera user is being added.

Importing Hadoop metadata

Before connecting the Talend Studio to the Hadoop cluster, it is a good practice to import Hadoop metadata using the wizard or manually.

Detailed information can be found in Centralizing a Hadoop Connection.

This procedure shows how to import Hadoop metadata for Cloudera using the Import Wizard.

Procedure

  1. Select the Retrieve configuration from Ambari or Cloudera option to open the Cloudera Manager Configuration window.
  2. Enter the Manager URI, click Connect, and then click Fetch.

Connecting the Client (Studio/Job server) to the cluster

When the Talend Studio will be used to request services from a Kerberos-enabled Hadoop distribution, you need to configure the connection to that Hadoop distribution in the Studio. This enables the Studio to use your Kerberos Ticket Grant Ticket to perform transactions with that secured Hadoop distribution.

Connections to a Kerberos-enabled cluster can be created using either kinit or keytab. The following sections describe how to perform each setup, and test the connection by writing some data to the cluster.

Connecting to the cluster using kinit

With the 'kinit' method, the user who executes the Job must be authenticated against Kerberos. Below is a sample setup.

Setting up a connection using kinit

Procedure

  1. In the client machine, execute the kinit command to get a ticket-granting ticket (TGT). Then when prompted, enter the password you set when creating the principal of the client user.
    Tip: For further information about this command, see Obtaining tickets with kinit in the MIT Kerberos documentation.

    When the password is correct, a ticket is generated and stored in the client machine.

  2. In the Integration perspective of the Talend Studio, create an empty Job.
  3. Drop tHDFSConnection into the workspace, which will be used to connect the Studio to the Hadoop cluster.
  4. Double-click tHDFSConnection to open its Component view.
  5. Select Repository in the Property Type drop down menu and click the […] button to select the HDFS connection that was created in the earlier step.

Results

The NameNode URI field and the Namenode principal fields will be automatically filled out from the connection information.

Writing data to the secured Hadoop Cluster

To confirm that your connection is configured correctly, test writing to the Hadoop cluster.

In the following example, tFixedFlowInput is used to read records and tHDFSOutput is used to write records to the Hadoop cluster.

Reading Input Data

Procedure

  1. In the Job created in the previous section, drop tFixedFlowInput and tHDFSOutput in the Job.
  2. Connect tHDFSConnection you have configured in the previous section to tFixedFlowInput using the Trigger > On Subjob Ok link.
  3. Connect tFixedFlowInput to tHDFSOutput using Row > Main link.
  4. Double-click tFixedFlowInput to open its Component view.
  5. Select Use Inline Content and in the Field Separator field, enter semicolon (;).
  6. In the Content panel, paste the following records as test data.
    1; Texas
    2; California
    3; Illinois
    4; New York
    5; Florida
  7. Click Edit schema to define the schema of the input data.
  8. Click the [+] button twice to add two rows to the schema editor and rename them to id and name, respectively.
  9. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

Writing Data to Hadoop

Procedure

  1. Double-click tHDFSOutput to open its Component view.
  2. Select the Use an existing connection check box to reuse the connection to Hadoop you have defined in the previous section.
  3. In the File Name field, enter the path, or browse to the directory in HDFS in which you want to write the data. This directory is created if it does not exist in the HDFS.

    In this step, if you are able to navigate to the HDFS system, this means the secured connection you defined in the previous sections has been successful.

  4. From the Type list, select Text File.
  5. From the Action list, select Create.
  6. In the Field Separator field, enter \t.
  7. Press F6 to execute the Job and view the result in Hue, which is Cloudera’s web console of the HDFS service.

Connecting to the cluster using keytab

As an alternative method to kinit, you can also use a keytab to load data.

A keytab is a file containing pairs of Kerberos principals and encrypted keys. These keys are derived from the Kerberos password. You can use this file to log into Kerberos without being prompted for a password. The most common use of keytab files is to allow scripts to authenticate to Kerberos without human interaction, or to store a password in a plain text file.

Using a keytab, the user which executes the Talend Job doesn't have to do a 'kinit'. In addition, depending on the keytab content, the user which executes the Talend Job can impersonate another user. For this reason, keytab should only be transferred securely on the file system, and access should be limited to only those processes that require it.

Below is a sample setup to generate a keytab and to use it in a Talend Job to load data.

Setting up a connection

Procedure

  1. Create a keytab for the user "kuser1". You can use the kadmin.local command line interface to add principals and create keytab files, as shown shown below.
  2. Retrieve the keytab file into the client machine. Use Talend tSCPGet component or retrieve it manually with the UNIX cp command.

    In this example, cp command is used to move the keytab file to the UNIX home directory of the "kuser1" user where it has the read access to the keytab file.

    [root@quickstart cloudera]# cp /tmp/kuser1.keytab /home/kuser1
  3. In the Integration perspective of the Talend Studio, create an empty Job.
  4. Drop tHDFSConnection into the workspace, which will be used to connect the Studio to the Hadoop cluster.
  5. Double-click tHDFSConnection to open its Component view.
  6. In the Version area, select the distribution to be connected to.

    In this scenario, it is Cloudera from the Distribution list and Cloudera CDH5.4 from the Hadoop version list.

  7. In the NameNode URI field, enter the location of the NameNode. If you are using WebHDFS, the location should be webhdfs://masternode:portnumber; if this WebHDFS is secured with SSL, the scheme should be swebhdfs and you need to use a tLibraryLoad in the Job to load the library required by the secured WebHDFS.

    In this example, it is hdfs://quickstart.cloudera:8020.

  8. Select Use Kerberos authentication and enter nn/_HOST@EXAMPLE.COM in the NameNode principal field.
    Tip: You can find the NameNode principal value in the hdfs-site.xml file of the cluster you are using.
  9. Select the Use a keytab to authenticate check box.
  10. In the Principal field, enter the principal name of the keytab file.
    It is kuser1 in this example.
  11. In the Keytab field, enter the path of, or browse to the keytab file retrieved in the previous section.

Writing data to the secured Hadoop Cluster

To confirm that your connection is configured correctly, test writing to the Hadoop cluster.

In the following example, tFixedFlowInput is used to read records and tHDFSOutput is used to write records to the Hadoop cluster.

Reading Input Data

Procedure

  1. In the Job created in the previous section, drop tFixedFlowInput and tHDFSOutput in the Job.
  2. Connect tHDFSConnection you have configured in the previous section to tFixedFlowInput using the Trigger > On Subjob Ok link.
  3. Connect tFixedFlowInput to tHDFSOutput using Row > Main link.
  4. Double-click tFixedFlowInput to open its Component view.
  5. Select Use Inline Content and in the Field Separator field, enter semicolon (;).
  6. In the Content panel, paste the following records as test data.
    1; Texas
    2; California
    3; Illinois
    4; New York
    5; Florida
  7. Click Edit schema to define the schema of the input data.
  8. Click the [+] button twice to add two rows to the schema editor and rename them to id and name, respectively.
  9. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

Setting up Cloudera

This setup is to create a home directory for a kuser1 user in Hadoop HDFS. This is done by using the Cloudera web based administration tools.

More information about the Cloudera setup can be found in Mapping Kerberos Principals to Short Names in the Cloudera documentation.

More information about mapping principals to hdfs users can be found in Configuring the Mapping from Kerberos Principals to Short Names in the Cloudera documentation.

Procedure

  1. Log in to Hue and create a new user called kuser1 and add default as the group.
  2. Using the File Browser, create a home directory for kuser1 user in the /user directory.

    In this example, the directory /user/kuser1/ is used as the target directory.

  3. Log in to Cloudera Manager, click hdfs service > Configuration and type Kerberos Realms in the search box to open two properties in the Service-Wide/Security Category.
  4. Within Additional Rules to Map Kerberos Principals to Short Names enter the following rule:
    RULE:[1:$1@$0](kuser1@CLOUDERA.COM)s/.*/kuser1/
  5. Click Save Changes.

Writing Data to Hadoop

Procedure

  1. Double-click tHDFSOutput to open its Component view.
  2. Select the Use an existing connection check box to reuse the connection to Hadoop you have defined in the previous section.
  3. In the File Name field, enter the path "/user/kuser1/states". This directory is created if it does not exist in the HDFS.

    In this step, if you are able to navigate to the HDFS system, this means the secured connection you defined in the previous sections has been successful.