Exporting your preparation to the cluster

Preparing an HDFS-based dataset

EnrichVersion
6.3
2.0
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation

Now that your are done preparing your data, you will export it back to the cluster, but as a Parquet file this time.

Note that the cluster where you will export your cleansed data, must be the same cluster from which you imported the data in the first place.

Procedure

  1. Click the Export button in the application header bar.
  2. Select the All data radio button so that the whole data is prepared, and not just the sample you worked on.
  3. Select the HDFS file radio button to export your data to the Hadoop cluster.

    Note that the cluster where you will export your cleansed data, must be the same cluster from which you imported the data in the first place.

  4. Select the Parquet format.
  5. In the Output path field, enter the complete URL to your prefered location on the cluster to save the exported file.

    You can manually configure Talend Data Preparation to display a default value in the Output Path field.

  6. Select Specified kerberos as authentication method.
  7. Specify your principal and the path to your keytab file.

    If you choose Default Kerberos, the values for the keytab file path and the principal will be the ones entered in Talend Data Preparation configuration file.

    In any case, the path must point to a keytab file that is accessible to all the workers on the cluster.

    Select the Simple authentication if you are not using Kerberos.

  8. Click Confirm

    You export starts in the background, and is now being processed directly on the cluster.

    Note that if a preparation contains actions that only affect a single row, or cells, they will be skipped during the export process. A warning will be displayed before the export if your preparation contains such actions.

  9. Click the Export history button in the application header bar to check the status of the export.

    Among other information, you can see that the export was successful.

Results

Your data has been processed and saved as a parquet file, without leaving the cluster.