Adding a dataset from HDFS - 6.5

Talend Data Preparation User Guide

Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Data Preparation
Data Quality and Preparation > Cleansing data
You can access data stored on HDFS (Hadoop File System), directly from the Talend Data Preparation interface and import it in the form of a dataset.


  1. In the Datasets view of the Talend Data Preparation homepage, click the white arrow next to the Add Dataset button.
  2. Select From HDFS.

    The Add an HDFS dataset form opens.

  3. In the Dataset name field, enter the name you want to give your dataset.
  4. In the User name field enter your Linux user name.

    This user must have the reading rights on the file that you want to import.

  5. To enable Kerberos authentication, select the Use Kerberos check box.
  6. In the Principal field enter the name of the service principal.
  7. In the Keytab file field, enter the location of your keytab file.

    The keytab file must be accessible by the Spark Job Server.

    You can manually configure Talend Data Preparation to display a default value in those fields.

  8. In the Format field, select the format that corresponds to the file that you want to import.

    For HDFS files, Talend Data Preparation supports CSV, AVRO and PARQUET.

    If you choose CSV, select the record delimiter and field delimiter used for the file you want to import.

  9. In the Path field, enter the complete URL of your file in the Hadoop cluster.
  10. Click the Add Dataset button.


The data extracted from the cluster directly opens in the grid and you can start working on your preparation.

The data is still stored in the cluster and doesn't leave it, Talend Data Preparation only retrieves a sample on-demand.

Your dataset is now available in the Datasets view of the application home page.