Enabling the HDFS High Availability feature in the Studio

EnrichVersion
6.4
EnrichProdName
Talend Real-Time Big Data Platform
Talend Data Fabric
Talend Big Data Platform
Talend Big Data
Talend Open Studio for Big Data
task
Data Governance > Third-party systems > File components (Integration) > HDFS components
Data Quality and Preparation > Third-party systems > File components (Integration) > HDFS components
Design and Development > Third-party systems > File components (Integration) > HDFS components
EnrichPlatform
Talend Studio

Enabling the HDFS High Availability feature in the Studio

The HDFS High Availability feature addresses the single point of failure issue of a typical Hadoop cluster.

This article describes how to enable your Talend Studio with Big Data to use the HDFS High Availability (HA) feature.

Environment:

  • The Studio can be any of the Talend solutions with Big Data.

  • The Hadoop cluster you are using along with the Studio must support the HDFS HA feature. For further information, see the documentation of that Hadoop distribution you are using.

  • In the cluster to be used, the properties required by the HDFS High Availability must have been set in the hdfs-site.xml file by the Administrator.

Finding the properties to be set

You need to find the properties from this hdfs-site.xml file of the Hadoop cluster in order to replicate them in the Studio.

Procedure

  1. You need to find the dfs.nameservices property.

    For example, this property might read:

    <property>
       <name>dfs.nameservices</name>
       <value>nameservice1</value>
    </property>

    The value of this property is fundamental because it defines the name of the new nameservice and is used to define the other properties required by the HA feature. Therefore, you need to use this value, nameservice1 in this example, to find the other properties to be replicated.

  2. Use the nameservice1 value to find the following properties. Note this value nameservice1 is taken for demonstration purposes only.
    <property>
      <name>dfs.client.failover.proxy.provider.nameservice1</name>
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    <property>
      <name>dfs.ha.namenodes.nameservice1</name>
      <value>namenode90,namenode96</value>
    </property>

    The value, namenode90,namenode96 in this example, of the dfs.ha.namenodes.nameservice1 property defines the IDs of the NameNodes in this nameservice. The IDs are separated by comma (,).

  3. Use the NameNode IDs defined in the dfs.ha.namenodes.nameservice1 property to find the following properties:
    <property>
      <name>dfs.namenode.rpc-address.nameservice1.namenode90</name>
      <value>cdh4ha:8020</value>
    </property>
    <property>
      <name>dfs.namenode.rpc-address.nameservice1.namenode96</name>
      <value>cdh4ha2:8020</value>
    </property>

    They define the RPC address of each NameNode in this new nameservice.

Results

All the properties to be replicated have been found and you need to set them in the Hadoop properties table provided in the Studio.

Setting properties in the Studio

The Hadoop properties table is provided along with many different components, Hadoop configuration view or metadata wizards that create connection to a Hadoop cluster.

In this article, we take tHDFSConnection as example to demonstrate how to set the properties mentioned above.

Procedure

  1. After dropping a tHDFSConnection component into the Job design workspace, double click this component to open its Component view.
  2. Properly configure connection to the HDFS system to be used in the Basic settings of this component.
  3. Under the Hadoop properties table, click the [+] button five times to add five rows.
  4. Enter each of the properties mentioned above in the newly added rows of the Property column; respectively.
  5. In the Value column, enter the values corresponding to each property to be set.

Results

Then these properties will be taken into account at runtime when this component is used to connect to a Hadoop cluster.