Importing data from the cluster - 7.1

Talend Data Preparation Quick Examples

Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Data Preparation
Data Quality and Preparation > Cleansing data
You will access your data stored on HDFS (Hadoop File System), directly from the Talend Data Preparation interface and import it in the form of a dataset.


  1. In the Datasets view of the Talend Data Preparation homepage, click the white arrow next to the Add Dataset button.
  2. Select HDFS.

    The Add a HDFS dataset form opens.

  3. In the Dataset name field, enter the name you want to give your dataset., HDFS_dataset in this example.
  4. In the User name field enter the name of the Linux user on the cluster.

    This user must have the reading rights on the file that you want to import.

  5. For this example, leave the Use Kerberos check box unselected.

    If you chose to authenticate via Kerberos, enter your principal and the path to your keytab file.

    The keytab file must be accessible by the Spark Job Server.

    You can manually configure Talend Data Preparation to display a default value in those fields.

  6. In the Format field, select the format in which your data was stored in the cluster, .csv in this case.
  7. In the Path field, enter the complete URL of your file in the Hadoop cluster.
  8. Click Add Dataset.


The data extracted from the cluster directly opens in the grid and you can start working on your preparation.

The data is still stored in the cluster and doesn't leave it, Talend Data Preparation only retrieves a sample on-demand.

Your dataset is now available in the Datasets view of the application home page.