Hortonworks - Getting Started
- You have installed and configured Hortonworks 2.4 cluster (HDP).
You can also use Hortonworks (sandbox), a downloadable virtual machine (VM).
- You have installed Talend Studio.
- The dataset used (pearsonData.csv) in this article is called Pearson’s Height Data,
named for its creator Karl Pearson who, in the early 1900’s, founded the Mathematical
You can download the Pearson dataset here. Feel free to use your own data, being mindful that aspects of this article will need to be adjusted.
Installing Hortonworks Sandbox
The easiest way to install Hortonworks is to download the fully configured Hortonworks Sandbox. Delivered as a virtual machine, this fully configured environment provides a wealth of features that are part of the Hadoop ecosystem.
The VM is available in two formats: VMWare and VirtualBox.
Performing the Hortonworks Post-Installation Steps
Create the OS user
You must create a standard Linux operating system (OS) user in the HDP environment. This user is the owner of all inbound and outbound data from within the HDP cluster.
A standard Linux account named puccini was created and is referenced throughout this article.
Many of the configuration files within the HDP 2.4 Sandbox make reference to the VM’s domain name, not its IP. To facilitate ease of communication between Talend and the HDP cluster, you need to modify your local hosts file to map the logical URL of the cluster to its IP.
- On Windows, modify C:\Windows\System32\drivers\etc\hosts to include
192.168.132.128 sandbox.hortonworks.com # Your IP is most likely
different.Note: The URL can be an IP address or any other mapped domain name.
The HDP environment is now configured and accessible to Talend.
Working With HDFS
For the purposes of this article, only Talend Studio is required.
If you do not have Talend already installed, follow either one of the following options available on the Talend Website.
Once installed, start your Talend Studio and create a new project, as described in the How to create a project documentation.
Create the cluster metadata - Hortonworks 2.4
Before you begin
- You have opened Talend Studio.
- In the Repository, expand Metadata and then right-click Hadoop Cluster.
- Select Create Hadoop Cluster from the contextual menu to open the Hadoop Cluster Connection wizard.
Fill in generic information about this connection, such as
Name, Purpose and
Description and click
In the Distribution area, select
Hortonworks as the distribution,
V2.4.0 as the version, and select the option
Retrieve configuration from Ambari or Cloudera to
import the configuration information directly from that management service.
Then, click Next.
In the Enter Ambari credentials area, provide the
In this example:
Ambari URI (with port) http://sandbox.hortonworks.com:8080/ Username maria_dev (Standard Ambari account) Password maria_dev
- Click the Connect button to create the connection from the Studio to Ambari manager, and then click the Fetch button to retrieve and list the configurations of the services running on the HDP cluster.
In the Discovered clusters area, leave the default check
boxes selected and click Finish.
Note: Then the relevant configuration information is automatically filled in the next step.
In Define the connection parameters, in the
Authentification area, fill in the User
name field with the Linux OS user created in an earlier step,
Click Check Services, then close the Checking
Hadoop Services dialog box.
Note: Success is indicated by a 100% connection to both the Namenode and Resource Manager.
- Click Finish to close the Hadoop Cluster Connection wizard.
Create HDFS Metadata - Hortonworks
Before you begin
- You have created a cluster metadata connection.
- In the Repository, right-click your cluster metadata, in this example Hortonworks24_Cluster, and click Create HDFS.
- Name your new connection and click Next.
Click the Check button to verify that your connection is
Note: The User name property is automatically pre-filled with the value inherited from the Hadoop connection you selected in the previous steps, in this example puccini.
- Click Finish to complete the steps.
Write Data to HDFS - Hortonworks
Before you begin
- You have created a HDFS connection object leveraging the cluster repository connection you just created.
- In the Studio, create a new Standard Job and add a tHDFSPut component to the design space.
In the Component view of the
tHDFSPut, configure its parameters as follow.
- In the Property Type field, choose Repository and the HDFS connection object you just created.
- Choose a local file to move to HDFS indicating its destination.