Connecting a MapR distribution to the Talend Studio using cluster metadata

author
Frédérique Martin Sainte-Agathe
EnrichVersion
6.4
6.3
6.2
6.1
EnrichProdName
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
Talend Data Fabric
Talend Big Data
Talend Big Data Platform
task
Design and Development > Designing Jobs > Hadoop distributions > MapR
EnrichPlatform
Talend Studio

MapR Connecting a MapR distribution to the Talend Studio using cluster metadata

This article provides step-by-step instructions for connecting the Talend Studio to a MapR distribution. The connection will allow your Jobs and components to process your data in a MapR cluster using the Spark or MapReduce framework.

To run Big Data Jobs, the Talend Studio must be connected to a running Hadoop cluster. You can either configure the connection information in each individual component or store the configuration in metadata in the Repository and reuse it in components as needed.

We’ll take the second approach, which is the most efficient way to configure connection information.

You’re now ready to create cluster metadata.

Environment

This article was written and validated using Talend Studio 6.1 to connect to a MapR 5.0 cluster.

Solution

Configuring the Studio to use the MapR client:

MapR Hadoop cluster metadata is created manually. This requires having information about your cluster, such as the Namenode URI, Resource Manager address, or Job Tracker URI, depending on whether you’re using YARN or MapReduce v1. You may also need other information, such as the Job history or Resource Manager scheduler location.

  1. In Studio > Repository > Metadata , right-click Hadoop Cluster , then click Create Hadoop Cluster :

2. In the Name box, enter MapRCluster and click Next . The Hadoop Configuration Import Wizard opens:

3. In the Distribution list, select MapR , and in the Version list, select MapR 5.0.0(YARN mode).

4. Select Enter manually Hadoop services and click Finish .

The Hadoop Cluster Connection window opens:

5. Confirm that the distribution information is correct.

A few values, such as the Namenode URI and Resource Manager address, are preconfigured.

Change the localhost value to the IP address or DNS name of your cluster. If the cluster was configured with the default port values, then 7222 and 8032 are the host ports for the Namenode and Resource Manager, respectively.

6. Configure the connection as follows:

Namenode URI: maprfs:///

Resource Manager: <ClusterName>:8032

Resource Manager Scheduler: <ClusterName>:8030

Job History: <ClusterName>:10020

Staging directory: /var/mapr/cluster/yarn/rm/staging

User name: <UserName>

Group name: <GroupName>

7. Check your configuration:

8. Click Check Services to verify the connection to the cluster:

If the progress bars go up to 100% with no error message, you’re connected.

9. Click Finish . Your cluster metadata appears in Repository > Metadata > Hadoop Cluster .

Now that you are successfully connected, you can reuse the metadata in Jobs and components to process your data using the Spark or MapReduce framework.

Related articles

MapR: Tips for starting with a MapR 5.0.0 sandbox