Getting the data from the HDFS - 7.0

HDFS

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > File components (Integration) > HDFS components
Data Quality and Preparation > Third-party systems > File components (Integration) > HDFS components
Design and Development > Third-party systems > File components (Integration) > HDFS components
EnrichPlatform
Talend Studio

Procedure

  1. Double-click tHDFSGet to define the component in its Basic settings view.
  2. Select, for example, Apache 0.20.2 from the Hadoop version list.
  3. In the NameNode URI, the Username, the Group fields, enter the connection parameters to the HDFS. If you are using WebHDFS, the location should be webhdfs://masternode:portnumber; if this WebHDFS is secured with SSL, the scheme should be swebhdfs and you need to use a tLibraryLoad in the Job to load the library required by the secured WebHDFS.
  4. In the HDFS directory field, type in location storing the loaded file in HDFS. In this example, it is /testFile.
  5. Next to the Local directory field, click the three-dot [...] button to browse to the folder intended to store the files that are extracted out of the HDFS. In this scenario, the directory is: C:/hadoopfiles/getFile/.
  6. Click the Overwrite file field to stretch the drop-down.
  7. From the menu, select always.
  8. In the Files area, click the plus button to add a row in which you define the file to be extracted.
  9. In the File mask column, enter *.txt to replace newLine between quotation marks and leave the New name column as it is. This allows you to extract all the .txt files from the specified directory in the HDFS without changing their names. In this example, the file is in.txt.