MapR - Getting Started
- You have installed and configured MapR 5.1 (or later).
- You have installed Talend Studio.
- The dataset used (pearsonData.csv) in this article is called Pearson’s Height Data,
named for its creator Karl Pearson who, in the early 1900’s, founded the Mathematical
You can download the Pearson dataset here. Feel free to use your own data, being mindful that aspects of this article will need to be adjusted.
Installing MapR Sandbox
The easiest way to install MapR is to download the fully configured MapR Sandbox. Delivered as a virtual machine, this fully configured environment provides a wealth of features that are part of the Hadoop ecosystem.
The VM is available in two formats: VMWare and VirtualBox.
You can find complete instructions on installing and setting up the sandbox here.
Performing the MapR Post-Installation Steps
Create the OS user
- At the VM console, press Alt+F2 and log in as root with the password mapr.
- Add the puccini user with the following command:
adduser -G mapr --home /user/puccini puccini
Create a hdfs user
- The VM console will identify the URL to manage your sandbox. To view the banner page, press Alt-F1 at the console.
- Open a browser, then navigate to the URL provided and launch Hue. The default sandbox username and password is mapr/mapr.
- Click the administration icon in the upper right corner, and click Manage
- Create a user with username puccini and password puccini. Leave the option Create home directory checked.
- Make sure that the user is part of the default group and click
Add user to finish.
- Click the Manage HDFS icon in the upper right corner to check that the puccini user is created, and then navigating to the /user directory. You should see a directory called puccini.
Working With HDFS
For the purposes of this article, only Talend Studio is required.
If you do not have Talend already installed, follow either one of the following options available on the Talend Website.
Once installed, start your Talend Studio and create a new project, as described in the How to create a project documentation.
Create the cluster metadata - MapR 5.2
Before you begin
- You have opened Talend Studio.
- You have installed the MapR Client in order to connect to the MapR Sandbox
Follow the instructions for your client operating system here. When running the configure script, the default Sandbox cluster name is demo.mapr.com and the IP address can be found on the VM Console banner page (found by pressing Alt+F1 at the console).For example, you can configure script execution on Windows:
server\configure.bat -N demo.mapr.com -c -C 192.168.111.134:7222
- In the Repository, expand Metadata and then right-click Hadoop Cluster.
- Select Create Hadoop Cluster from the contextual menu to open the [Hadoop Cluster Connection] wizard.
Fill in generic information about this connection, such as
Name, Purpose and
Description and click
In the Distribution area, select
MapR as the Distribution, MapR 5.2.0 (YARN
mode) or other as appropriate as the Version, and then select the option
Enter manually Hadoop services.
- Click Finish.
Define the connection parameters as below.
- Change localhost to maprfs:///.
- Set Staging directory to /user/puccini.
- Set User name to puccini
and Group to default.
Click Check Services to test that the Studio can connect
Note: At this point, Talend may prompt to install the maprfs jar file. You can navigate to the appropriate jar file under the MapR Client installation that you created above, for example on Windows
C:\opt\mapr\lib\maprfs-5.2.0-mapr.jar.Note: Success is indicated by a 100% connection to both the Namenode and Resource Manager.
- Click Close.
- Click Finish to close the [Hadoop Cluster Connection] wizard.
Create HDFS Metadata - MapR
Before you begin
- You have created a cluster metadata connection.
- In the Repository, right-click your cluster metadata, in this example MapR52_Cluster, and click Create HDFS.
- Name your new connection, in this example HDFS52, and click Next.
Click the Check button to verify that your connection is successful.
The check should return with a connection successful message as shown below.
- Click Finish to complete the steps.
Write Data to HDFS - MapR
Before you begin
- You have created a HDFS connection object leveraging the cluster repository connection you just created.
- In the Studio, create a new Standard Job.
- Drag and drop the HDFS metadata, HDFS52, from the Repository to the design space and select tHDFSPut.
- In the Component view of the tHDFSPut, specify the file you want to upload.
Run the Job, then verify that the file has been moved to HDFS.
Use Hue for the verification. If you get an error about winutils, see The missing winutils.exe program in the Big Data Jobs.
If you need to retrieve data from HDFS, you can use the
tHDFSInput component to pull data from HDFS and drop
it on the local file system or the tLogRow component to
view the data in the Run Job panel.
Note: With the tHDFSInput component, you can also view the data with Data Viewer, by right-clicking the component.In this example, add a tLogRow component and configure its parameters as follows.
- Header is set to 1.
- Field Separator is set to ','.
- The schema contains two columns named Father and Son. Both columns data type is set to float and not null.