Cloudera - Getting Started
Prerequisites
- You have installed and configured Cloudera 5.7 cluster (CDH).
- You have installed Talend Studio.
- The dataset used (pearsonData.csv) in this article is called Pearson’s Height Data,
named for its creator Karl Pearson who, in the early 1900’s, founded the Mathematical
Statistics discipline.
The Pearson dataset can be downloaded here. Feel free to use your own data, being mindful that aspects of this article will need to be adjusted.
Installing the Cloudera Quickstart VM
The easiest way to get started is to install the Cloudera Quickstart VM which is delivered as a fully configured environment. The Cloudera Quickstart VM is available in many formats including VMWare, VirtualBox, KVM and Docker images.
To install the Cloudera Quickstart VM, see on the Downloads page of the Cloudera Website.
Procedure
Results
Once you have a Cloudera Hadoop Distribution running, you will be able to follow the steps demonstrated in this article to connect and load data into Cloudera (CDH).
Performing the Cloudera Post-Installation Steps
Configure hosts file
- Open a Terminal window in the Cloudera VM.
- Type the command hostname.As can be seen below, our VM hostname is quickstart.cloudera.
- Type the command ifconfig to find out the IP address of the VM.
- On Windows, modify C:\Windows\System32\drivers\etc\hosts to include
192.168.164.133 quickstart.cloudera # Your IP is most likely different.
Create a hdfs user
- Open a new tab in the browser in the Cloudera VM, and click the Hue link (as shown below) to start the application.
- Login into Hue with username cloudera and
password cloudera.
- Click the administration icon in the upper right corner, and click Manage Users.
- Create a user with username puccini and password
puccini. Leave the option Create home
directory checked.
- Make sure that the user is part of the default and
hadoop group and click Add user to finish.
- Click the Manage HDFS icon in the upper right corner to check that the puccini user is created, and then navigating to the /user directory. You should see a directory called puccini.
Working With HDFS
Prerequisites
For the purposes of this article, only Talend Studio is required.
If you do not have Talend already installed, follow either one of the following options available on the Talend Website.
Once installed, start your Talend Studio and create a new project, as described in the How to create a project documentation.
Create the cluster metadata - Cloudera 5.7
Before you begin
- You have opened Talend Studio.
Procedure
Create HDFS Metadata - Cloudera
Before you begin
- You have created a cluster metadata connection.
Procedure
Write Data to HDFS - Cloudera
Before you begin
- You have created a HDFS connection object leveraging the cluster repository connection you just created.