Setting reusable Hadoop properties - 6.1

Talend Big Data Studio User Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
task
Design and Development
EnrichPlatform
Talend Studio

When setting up a Hadoop connection, you can define a set of common Hadoop properties that will be reused by its child connections to each individual Hadoop element such as Hive, HDFS or HBase.

For example, in the Hadoop cluster you need to use, you have defined the HDFS High Availability (HA) feature in the hdfs-site.xml file of the cluster itself; then you need to set the corresponding properties in the connection wizard in order to enable this High Availability feature in the Studio. Note that these properties can also be set in a specific Hadoop related component and the process of doing this is explained in the following article: https://help.talend.com/display/KB/Enabling+the+HDFS+High+Availability+feature+in+the+Studio. In this section, only the connection wizard approach is presented.

Prerequisites:

  • Launch the Hadoop distribution you need to use and ensure that you have the proper access permission to that distribution and its Oozie.

  • The High Availability properties to be set in the Studio have been defined in the hdfs-site.xml file of the cluster to be used.

In this example, the High Availability properties are:

<property>  
  <name>dfs.nameservices</name>  
  <value>nameservice1</value>
</property>
<property>
  <name>dfs.client.failover.proxy.provider.nameservice1</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
  <name>dfs.ha.namenodes.nameservice1</name>
  <value>namenode90,namenode96</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.nameservice1.namenode90</name>
  <value>hdp-ha:8020</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.nameservice1.namenode96</name>
  <value>hdp-ha2:8020</value>
</property>

The values of these properties are for demonstration purposes only.

To set these properties in the Hadoop connection, open the [Hadoop Cluster Connection] wizard from the Hadoop cluster node of the Repository. For further information about how to access this wizard, see Centralizing a Hadoop connection.

  1. Properly configure the connection to the Hadoop cluster to be used as explained in the previous sections, if you have not done so.

  2. Click the [...] button next to Hadoop properties to open the Hadoop properties table.

  3. Add the above-listed High Available properties to this table.

  4. Click OK to validate the changes. These properties are then listed next to the [...] button.

  5. Click the Check services button to verify the connection.

    A dialog box pops up to indicate the checking process and the connection status. If it shows that the connection fails, you need to review and update the connection information you have defined in the connection wizard.

  6. Click Finish to validate the connection.

    Then when you create a child connection, for example to Hive, from this Hadoop connection, these High Availability properties will be inherited there as read-only parent properties.

This way, these properties can be automatically reused by any of its child Hadoop connection.

The image above shows these properties inherited in the Hive connection wizard. For further information about how to access the Hive connection wizard as presented in this section, see Centralizing Hive metadata.