Cloudera - Getting Started
- You have installed and configured Cloudera 5.7 cluster (CDH).
- You have installed Talend Studio.
- The dataset used (pearsonData.csv) in this article is called Pearson’s Height Data,
named for its creator Karl Pearson who, in the early 1900’s, founded the Mathematical
The Pearson dataset can be downloaded here. Feel free to use your own data, being mindful that aspects of this article will need to be adjusted.
Installing the Cloudera Quickstart VM
The easiest way to get started is to install the Cloudera Quickstart VM which is delivered as a fully configured environment. The Cloudera Quickstart VM is available in many formats including VMWare, VirtualBox, KVM and Docker images.
To install the Cloudera Quickstart VM, see on the Downloads page of the Cloudera Website.
- For this article, you are going to use the Cloudera VMWare image.
- Once downloaded, unzip the image in a folder on your disk using 7-Zip.
- In VMWare, click cloudera-quickstart-vm-5.7.0-0-vmware.vmx. to open the VMWare configuration file,
Configure the VM to use 8GB RAM and 4 CPUs.
Tip: If you do not have enough CPU, you can use 2 CPU.
Start the VM.
Once started, you will be logged into the VM and the browser will open up the Welcome page as shown below.
The Cloudera website provides extensive instructions on how to get started with the Cloudera Quickstart VM. If you encounter any problem, please refer to the Cloudera online help and Getting Started.Warning: Since you are using the Cloudera Quickstart VM on a workstation with 16 GB RAM only, you are NOT going to Launch Cloudera Manager. Cloudera Manager may need more than 8GB RAM to run properly as per the documentation on the Cloudera website as shown below.
Please note that the steps demonstrated in this article below are not limited to the Cloudera Quickstart VM. They can also be applied to a typical deployment of a Cloudera Cluster in Amazon Web Services or on-premise with many nodes.
Once you have a Cloudera Hadoop Distribution running, you will be able to follow the steps demonstrated in this article to connect and load data into Cloudera (CDH).
Performing the Cloudera Post-Installation Steps
Configure hosts file
- Open a Terminal window in the Cloudera VM.
- Type the command hostname.As can be seen below, our VM hostname is quickstart.cloudera.
- Type the command ifconfig to find out the IP address of the VM.
- On Windows, modify C:\Windows\System32\drivers\etc\hosts to include
192.168.164.133 quickstart.cloudera # Your IP is most likely different.
Create a hdfs user
- Open a new tab in the browser in the Cloudera VM, and click the Hue link (as shown below) to start the application.
- Login into Hue with username cloudera and
- Click the administration icon in the upper right corner, and click Manage Users.
- Create a user with username puccini and password
puccini. Leave the option Create home
- Make sure that the user is part of the default and
hadoop group and click Add user to finish.
- Click the Manage HDFS icon in the upper right corner to check that the puccini user is created, and then navigating to the /user directory. You should see a directory called puccini.
Working With HDFS
For the purposes of this article, only Talend Studio is required.
If you do not have Talend already installed, follow either one of the following options available on the Talend Website.
Once installed, start your Talend Studio and create a new project, as described in the How to create a project documentation.
Create the cluster metadata - Cloudera 5.7
Before you begin
- You have opened Talend Studio.
- In the Repository, expand Metadata and then right-click Hadoop Cluster.
- Select Create Hadoop Cluster from the contextual menu to open the [Hadoop Cluster Connection] wizard.
Fill in generic information about this connection, such as
Name, Purpose and
Description and click
In the Distribution area, select
Cloudera as the Distribution, Cloudera
CDH5.7(YARN mode) as the Version, and then select the option
Enter manually Hadoop services.
Note: You are not automatically retrieving the configuration from Cloudera Manager because you have not launched and configured the Cloudera Manager in your VM due to the high memory requirements when starting Cloudera Manager.
- Click Finish.
Define the connection parameters as below.
- Change localhost to quickstart.cloudera.
- Set the Staging directory to /user/puccini.
- Set the Username to
Click Check Services to test that the Studio can connect
to Cloudera, then click Close.
Note: Success is indicated by a 100% connection to both the Namenode and Resource Manager.
- Click Finish to close the [Hadoop Cluster Connection] wizard.
Create HDFS Metadata - Cloudera
Before you begin
- You have created a cluster metadata connection.
- In the Repository, right-click your cluster metadata, in this example Cloudera57_Cluster, and click Create HDFS.
- Name your new connection, in this example HDFS57, and click Next.
Click the Check button to verify that your connection is successful.
The check should return with a connection successful message as shown below.
- Click Finish to complete the steps.
Write Data to HDFS - Cloudera
Before you begin
- You have created a HDFS connection object leveraging the cluster repository connection you just created.
- In the Studio, create a new Standard Job.
- Drag the HDFS metadata from the Repository to the design space and select tHDFSPut.
In the Component view of the
tHDFSPut, specify the file you want to write your
data to HDFS.
Run the Job, then verify that the file has been moved to HDFS.
Use Hue for the verification. If you get an error about winutils, see The missing winutils.exe program in the Big Data Jobs.
If you need to read data from HDFS, you can use the tHDFSInput component
to pull data from HDFS and drop it on the local file system or the
tLogRow component to view the data in the
Run Job panel.
Note: With the tHDFSInput component, you can also view the data with Data Viewer, by right-clicking the component.In this example, add a tLogRow component and configure its parameters as follows.
- Header is set to 1.
- Field Separator is set to ','.
- The schema contains two columns named Father and Son. Both columns data type is set to float and not null.