Hortonworks - Getting Started

author
Louis Frolio
EnrichVersion
6.4
6.3
6.2
6.1
6.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Open Studio for Big Data
Talend Big Data
task
Design and Development > Designing Jobs > Hadoop distributions > Hortonworks
EnrichPlatform
Talend Studio

Hortonworks - Getting Started

This article demonstrates how to get started with Hortonworks 2.4.

Prerequisites

  • You have installed and configured Hortonworks 2.4 cluster (HDP).

    You can also use Hortonworks (sandbox), a downloadable virtual machine (VM).

  • You have installed Talend Studio.
  • The dataset used (pearsonData.csv) in this article is called Pearson’s Height Data, named for its creator Karl Pearson who, in the early 1900’s, founded the Mathematical Statistics discipline.

    You can download the Pearson dataset here. Feel free to use your own data, being mindful that aspects of this article will need to be adjusted.

Installing Hortonworks Sandbox

The easiest way to install Hortonworks is to download the fully configured Hortonworks Sandbox. Delivered as a virtual machine, this fully configured environment provides a wealth of features that are part of the Hadoop ecosystem.

The VM is available in two formats: VMWare and VirtualBox.

Note: The steps demonstrated in this article aren’t limited to the HDP Sandbox. They can also be applied to a typical deployment of Hortonworks 2.4.

Performing the Hortonworks Post-Installation Steps

Create the OS user

You must create a standard Linux operating system (OS) user in the HDP environment. This user is the owner of all inbound and outbound data from within the HDP cluster.

A standard Linux account named puccini was created and is referenced throughout this article.

Many of the configuration files within the HDP 2.4 Sandbox make reference to the VM’s domain name, not its IP. To facilitate ease of communication between Talend and the HDP cluster, you need to modify your local hosts file to map the logical URL of the cluster to its IP.

  • On Windows, modify C:\Windows\System32\drivers\etc\hosts to include 192.168.132.128 sandbox.hortonworks.com # Your IP is most likely different.
    Note: The URL can be an IP address or any other mapped domain name.

The HDP environment is now configured and accessible to Talend.

Working With HDFS

Prerequisites

For the purposes of this article, only Talend Studio is required.

If you do not have Talend already installed, follow either one of the following options available on the Talend Website.

Once installed, start your Talend Studio and create a new project, as described in the How to create a project documentation.

Create the cluster metadata - Hortonworks 2.4

Before you begin

  • You have opened Talend Studio.

Procedure

  1. In the Repository, expand Metadata and then right-click Hadoop Cluster.
  2. Select Create Hadoop Cluster from the contextual menu to open the [Hadoop Cluster Connection] wizard.
  3. Fill in generic information about this connection, such as Name, Purpose and Description and click Next.
  4. In the Distribution area, select Hortonworks as the distribution, V2.4.0 as the version, and select the option Retrieve configuration from Ambari or Cloudera to import the configuration information directly from that management service. Then, click Next.
  5. In the Enter Ambari credentials area, provide the required information.
    In this example:
    Ambari URI (with port) http://sandbox.hortonworks.com:8080/
    Username maria_dev (Standard Ambari account)
    Password maria_dev
  6. Click the Connect button to create the connection from the Studio to Ambari manager, and then click the Fetch button to retrieve and list the configurations of the services running on the HDP cluster.
  7. In the Discovered clusters area, leave the default check boxes selected and click Finish.
    Note: Then the relevant configuration information is automatically filled in the next step.
  8. In Define the connection parameters, in the Authentification area, fill in the User name field with the Linux OS user created in an earlier step, puccini.
  9. Click Check Services, then close the Checking Hadoop Services dialog box.
    Note: Success is indicated by a 100% connection to both the Namenode and Resource Manager.
  10. Click Finish to close the [Hadoop Cluster Connection] wizard.

Create HDFS Metadata - Hortonworks

Before you begin

  • You have created a cluster metadata connection.

Procedure

  1. In the Repository, right-click your cluster metadata, in this example Hortonworks24_Cluster, and click Create HDFS.
  2. Name your new connection and click Next.
  3. Click the Check button to verify that your connection is successful.
    Note: The User name property is automatically pre-filled with the value inherited from the Hadoop connection you selected in the previous steps, in this example puccini.
  4. Click Finish to complete the steps.

Write Data to HDFS - Hortonworks

Before you begin

  • You have created a HDFS connection object leveraging the cluster repository connection you just created.

Procedure

  1. In the Studio, create a new Standard Job and add a tHDFSPut component to the design space.
  2. In the Component view of the tHDFSPut, configure its parameters as follow.
    • In the Property Type field, choose Repository and the HDFS connection object you just created.
    • Choose a local file to move to HDFS indicating its destination.