MapR - Getting Started

author
Michael Verrilli
EnrichVersion
6.5
EnrichProdName
Talend Open Studio for Big Data
Talend Big Data Platform
Talend Big Data
Talend Real-Time Big Data Platform
Talend Data Fabric
task
Design and Development > Designing Jobs > Hadoop distributions > MapR
EnrichPlatform
Talend Studio

MapR - Getting Started

This article demonstrates how to get started with MapR 5.1 (or later).

Prerequisites

  • You have installed and configured MapR 5.1 (or later).
  • You have installed Talend Studio.
  • The dataset used (pearsonData.csv) in this article is called Pearson’s Height Data, named for its creator Karl Pearson who, in the early 1900’s, founded the Mathematical Statistics discipline.

    You can download the Pearson dataset here. Feel free to use your own data, being mindful that aspects of this article will need to be adjusted.

Installing MapR Sandbox

The easiest way to install MapR is to download the fully configured MapR Sandbox. Delivered as a virtual machine, this fully configured environment provides a wealth of features that are part of the Hadoop ecosystem.

The VM is available in two formats: VMWare and VirtualBox.

You can find complete instructions on installing and setting up the sandbox here.

Note: The steps demonstrated in this article aren’t limited to the MapR Sandbox. They can also be applied to a typical deployment of MapR 5.1.

Performing the MapR Post-Installation Steps

Create the OS user

  1. At the VM console, press Alt+F2 and log in as root with the password mapr.
  2. Add the puccini user with the following command: adduser -G mapr --home /user/puccini puccini

Create a hdfs user

  1. The VM console will identify the URL to manage your sandbox. To view the banner page, press Alt-F1 at the console.
  2. Open a browser, then navigate to the URL provided and launch Hue. The default sandbox username and password is mapr/mapr.
  3. Click the administration icon in the upper right corner, and click Manage Users.
  4. Create a user with username puccini and password puccini. Leave the option Create home directory checked.
  5. Make sure that the user is part of the default group and click Add user to finish.
  6. Click the Manage HDFS icon in the upper right corner to check that the puccini user is created, and then navigating to the /user directory. You should see a directory called puccini.

Working With HDFS

Prerequisites

For the purposes of this article, only Talend Studio is required.

If you do not have Talend already installed, follow either one of the following options available on the Talend Website.

Once installed, start your Talend Studio and create a new project, as described in the How to create a project documentation.

Create the cluster metadata - MapR 5.2

Before you begin

  • You have opened Talend Studio.
  • You have installed the MapR Client in order to connect to the MapR Sandbox from Talend.

    Follow the instructions for your client operating system here. When running the configure script, the default Sandbox cluster name is demo.mapr.com and the IP address can be found on the VM Console banner page (found by pressing Alt+F1 at the console).

    For example, you can configure script execution on Windows:
    server\configure.bat -N demo.mapr.com -c -C 192.168.111.134:7222

Procedure

  1. In the Repository, expand Metadata and then right-click Hadoop Cluster.
  2. Select Create Hadoop Cluster from the contextual menu to open the [Hadoop Cluster Connection] wizard.
  3. Fill in generic information about this connection, such as Name, Purpose and Description and click Next.
  4. In the Distribution area, select MapR as the Distribution, MapR 5.2.0 (YARN mode) or other as appropriate as the Version, and then select the option Enter manually Hadoop services.
  5. Click Finish.
  6. Define the connection parameters as below.
    • Change localhost to maprfs:///.
    • Set Staging directory to /user/puccini.
    • Set User name to puccini and Group to default.
  7. Click Check Services to test that the Studio can connect to MapR.
    Note: At this point, Talend may prompt to install the maprfs jar file. You can navigate to the appropriate jar file under the MapR Client installation that you created above, for example on Windows C:\opt\mapr\lib\maprfs-5.2.0-mapr.jar.
    Note: Success is indicated by a 100% connection to both the Namenode and Resource Manager.
  8. Click Close.
  9. Click Finish to close the [Hadoop Cluster Connection] wizard.

Create HDFS Metadata - MapR

Before you begin

  • You have created a cluster metadata connection.

Procedure

  1. In the Repository, right-click your cluster metadata, in this example MapR52_Cluster, and click Create HDFS.
  2. Name your new connection, in this example HDFS52, and click Next.
  3. Click the Check button to verify that your connection is successful.

    The check should return with a connection successful message as shown below.

  4. Click Finish to complete the steps.

Write Data to HDFS - MapR

Before you begin

  • You have created a HDFS connection object leveraging the cluster repository connection you just created.

Procedure

  1. In the Studio, create a new Standard Job.
  2. Drag and drop the HDFS metadata, HDFS52, from the Repository to the design space and select tHDFSPut.
  3. In the Component view of the tHDFSPut, specify the file you want to upload.
  4. Run the Job, then verify that the file has been moved to HDFS.
    Use Hue for the verification. If you get an error about winutils, see The missing winutils.exe program in the Big Data Jobs.
  5. If you need to retrieve data from HDFS, you can use the tHDFSInput component to pull data from HDFS and drop it on the local file system or the tLogRow component to view the data in the Run Job panel.
    Note: With the tHDFSInput component, you can also view the data with Data Viewer, by right-clicking the component.
    In this example, add a tLogRow component and configure its parameters as follows.
    • Header is set to 1.
    • Field Separator is set to ','.
    • The schema contains two columns named Father and Son. Both columns data type is set to float and not null.