Importing data from the cluster

Preparing an HDFS-based dataset

EnrichVersion
6.3
2.0
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation
You will access your data stored on HDFS (Hadoop File System), directly from the Talend Data Preparation interface and import it in the form of a dataset.

Procedure

  1. In the Datasets view of the Talend Data Preparation homepage, click the white arrow next to the Add Dataset button.
  2. Select From HDFS.

    The Add a HDFS dataset form opens.

  3. In the Dataset name field, enter the name you want to give your dataset., HDFS_dataset in this example.
  4. In the User name field enter the name of the Linux user on the cluster.

    This user must have the reading rights on the file that you want to import.

  5. For this example, leave the Use Kerberos check box unselected.

    If you chose to authenticate via Kerberos, enter your principal and the path to your keytab file.

    The keytab file must be accessible by the Spark Job Server.

    You can manually configure Talend Data Preparation to display a default value in those fields.

  6. In the Format field, select the format in which your data was stored in the cluster, .csv in this case.
  7. In the Path field, enter the complete URL of your file in the Hadoop cluster.
  8. Click Add Dataset.

Results

The data extracted from the cluster directly opens in the grid and you can start working on your preparation.

The data is still stored in the cluster and doesn't leave it, Talend Data Preparation only retrieves a sample on-demand.

Your dataset is now available in the Datasets view of the application home page.