Cloudera - Getting Started

author
Irshad Burtally
EnrichVersion
6.5
EnrichProdName
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
Talend Data Fabric
Talend Big Data
Talend Big Data Platform
task
Design and Development > Designing Jobs > Hadoop distributions > Cloudera
EnrichPlatform
Talend Studio

Cloudera - Getting Started

This article demonstrates how to get started with Cloudera 5.7.

Prerequisites

  • You have installed and configured Cloudera 5.7 cluster (CDH).
  • You have installed Talend Studio.
  • The dataset used (pearsonData.csv) in this article is called Pearson’s Height Data, named for its creator Karl Pearson who, in the early 1900’s, founded the Mathematical Statistics discipline.

    The Pearson dataset can be downloaded here. Feel free to use your own data, being mindful that aspects of this article will need to be adjusted.

Installing the Cloudera Quickstart VM

The easiest way to get started is to install the Cloudera Quickstart VM which is delivered as a fully configured environment. The Cloudera Quickstart VM is available in many formats including VMWare, VirtualBox, KVM and Docker images.

To install the Cloudera Quickstart VM, see on the Downloads page of the Cloudera Website.

Procedure

  1. For this article, you are going to use the Cloudera VMWare image.
  2. Once downloaded, unzip the image in a folder on your disk using 7-Zip.
  3. In VMWare, click File > Open to open the VMWare configuration file, cloudera-quickstart-vm-5.7.0-0-vmware.vmx.
  4. Configure the VM to use 8GB RAM and 4 CPUs.
    Tip: If you do not have enough CPU, you can use 2 CPU.
  5. Start the VM.
    Once started, you will be logged into the VM and the browser will open up the Welcome page as shown below.

    The Cloudera website provides extensive instructions on how to get started with the Cloudera Quickstart VM. If you encounter any problem, please refer to the Cloudera online help and Getting Started.

    Warning: Since you are using the Cloudera Quickstart VM on a workstation with 16 GB RAM only, you are NOT going to Launch Cloudera Manager. Cloudera Manager may need more than 8GB RAM to run properly as per the documentation on the Cloudera website as shown below.

    Please note that the steps demonstrated in this article below are not limited to the Cloudera Quickstart VM. They can also be applied to a typical deployment of a Cloudera Cluster in Amazon Web Services or on-premise with many nodes.

Results

Once you have a Cloudera Hadoop Distribution running, you will be able to follow the steps demonstrated in this article to connect and load data into Cloudera (CDH).

Performing the Cloudera Post-Installation Steps

There are two post-installation steps you need to perform before you can connect to the cluster from Talend Studio.

Configure hosts file

  1. Open a Terminal window in the Cloudera VM.
  2. Type the command hostname.
    As can be seen below, our VM hostname is quickstart.cloudera.
  3. Type the command ifconfig to find out the IP address of the VM.
  4. On Windows, modify C:\Windows\System32\drivers\etc\hosts to include 192.168.164.133 quickstart.cloudera # Your IP is most likely different.

Create a hdfs user

  1. Open a new tab in the browser in the Cloudera VM, and click the Hue link (as shown below) to start the application.
  2. Login into Hue with username cloudera and password cloudera.
  3. Click the administration icon in the upper right corner, and click Manage Users.
  4. Create a user with username puccini and password puccini. Leave the option Create home directory checked.
  5. Make sure that the user is part of the default and hadoop group and click Add user to finish.
  6. Click the Manage HDFS icon in the upper right corner to check that the puccini user is created, and then navigating to the /user directory. You should see a directory called puccini.

Working With HDFS

Prerequisites

For the purposes of this article, only Talend Studio is required.

If you do not have Talend already installed, follow either one of the following options available on the Talend Website.

Once installed, start your Talend Studio and create a new project, as described in the How to create a project documentation.

Create the cluster metadata - Cloudera 5.7

Before you begin

  • You have opened Talend Studio.

Procedure

  1. In the Repository, expand Metadata and then right-click Hadoop Cluster.
  2. Select Create Hadoop Cluster from the contextual menu to open the [Hadoop Cluster Connection] wizard.
  3. Fill in generic information about this connection, such as Name, Purpose and Description and click Next.
  4. In the Distribution area, select Cloudera as the Distribution, Cloudera CDH5.7(YARN mode) as the Version, and then select the option Enter manually Hadoop services.
    Note: You are not automatically retrieving the configuration from Cloudera Manager because you have not launched and configured the Cloudera Manager in your VM due to the high memory requirements when starting Cloudera Manager.
  5. Click Finish.
  6. Define the connection parameters as below.
    • Change localhost to quickstart.cloudera.
    • Set the Staging directory to /user/puccini.
    • Set the Username to puccini.
  7. Click Check Services to test that the Studio can connect to Cloudera, then click Close.
    Note: Success is indicated by a 100% connection to both the Namenode and Resource Manager.
  8. Click Finish to close the [Hadoop Cluster Connection] wizard.

Create HDFS Metadata - Cloudera

Before you begin

  • You have created a cluster metadata connection.

Procedure

  1. In the Repository, right-click your cluster metadata, in this example Cloudera57_Cluster, and click Create HDFS.
  2. Name your new connection, in this example HDFS57, and click Next.
  3. Click the Check button to verify that your connection is successful.

    The check should return with a connection successful message as shown below.

  4. Click Finish to complete the steps.

Write Data to HDFS - Cloudera

Before you begin

  • You have created a HDFS connection object leveraging the cluster repository connection you just created.

Procedure

  1. In the Studio, create a new Standard Job.
  2. Drag the HDFS metadata from the Repository to the design space and select tHDFSPut.
  3. In the Component view of the tHDFSPut, specify the file you want to write your data to HDFS.
  4. Run the Job, then verify that the file has been moved to HDFS.
    Use Hue for the verification. If you get an error about winutils, see The missing winutils.exe program in the Big Data Jobs.
  5. If you need to read data from HDFS, you can use the tHDFSInput component to pull data from HDFS and drop it on the local file system or the tLogRow component to view the data in the Run Job panel.
    Note: With the tHDFSInput component, you can also view the data with Data Viewer, by right-clicking the component.
    In this example, add a tLogRow component and configure its parameters as follows.
    • Header is set to 1.
    • Field Separator is set to ','.
    • The schema contains two columns named Father and Son. Both columns data type is set to float and not null.