Hortonworks - Getting Started
Prerequisites
- You have installed and configured Hortonworks 2.4 cluster (HDP).
You can also use Hortonworks (sandbox), a downloadable virtual machine (VM).
- You have installed Talend Studio.
- The dataset used (pearsonData.csv) in this article is called Pearson’s Height Data,
named for its creator Karl Pearson who, in the early 1900’s, founded the Mathematical
Statistics discipline.
You can download the Pearson dataset here. Feel free to use your own data, being mindful that aspects of this article will need to be adjusted.
Installing Hortonworks Sandbox
The easiest way to install Hortonworks is to download the fully configured Hortonworks Sandbox. Delivered as a virtual machine, this fully configured environment provides a wealth of features that are part of the Hadoop ecosystem.
The VM is available in two formats: VMWare and VirtualBox.
Performing the Hortonworks Post-Installation Steps
Create the OS user
You must create a standard Linux operating system (OS) user in the HDP environment. This user is the owner of all inbound and outbound data from within the HDP cluster.
A standard Linux account named puccini was created and is referenced throughout this article.
Many of the configuration files within the HDP 2.4 Sandbox make reference to the VM’s domain name, not its IP. To facilitate ease of communication between Talend and the HDP cluster, you need to modify your local hosts file to map the logical URL of the cluster to its IP.
- On Windows, modify C:\Windows\System32\drivers\etc\hosts to include
192.168.132.128 sandbox.hortonworks.com # Your IP is most likely
different.Note: The URL can be an IP address or any other mapped domain name.
The HDP environment is now configured and accessible to Talend.
Working With HDFS
Prerequisites
For the purposes of this article, only Talend Studio is required.
If you do not have Talend already installed, follow either one of the following options available on the Talend website.
Once installed, start your Talend Studio and create a new project, as described in the How to create a project documentation.
Create the cluster metadata - Hortonworks 2.4
Before you begin
- You have opened Talend Studio.
Procedure
Create HDFS Metadata - Hortonworks
Before you begin
- You have created a cluster metadata connection.
Procedure
Write Data to HDFS - Hortonworks
Before you begin
- You have created a HDFS connection object leveraging the cluster repository connection you just created.