Exporting a preparation made on an HDFS dataset

When you are finished preparing your dataset extracted from HDFS, you have the possibility to export it back directly to the cluster, or download it as a local file.

Note that the cluster where you will export your cleansed data, must be the same cluster from which you imported the data in the first place.

Procedure

Click the Export button in the application header bar.
If the result of your preparation is larger than your current sample size, 10 000 rows by default, select an export option:
- If you select Current sample, only the sample you have been working on will be exported, as a local file with separator, xlsx or tableau file.
- If you select All data, all the preparations steps you have performed on your sample will be applied to the rest of the dataset as well, and the HDFS export will be enabled.
Select HDFS.
In the Format field, select the output format for your data.

For HDFS files, Talend Data Preparation supports files with separator, AVRO and PARQUET.

If you choose file with separator, select the delimiter to use for the output file.
In the Output path field, enter the complete URL to your prefered location on the cluster to save the exported file.
If you chose to authenticate via a custom keytab, enter your principal and the path to your keytab file.

The path must point to a keytab file that is accessible to all the workers on the cluster.
Click Confirm.

Note that if a preparation contains actions that only affect a single row, or cells, they will be skipped during the export process. The Make as header or Delete Row functions for example do not work in a Big Data context. A warning will be displayed before the export if your preparation contains such actions.

Results

If you chose to export your sample as a local file, your download of the output file directly starts.

In the case of a full export, whether it is as a local file or to the cluster, the export operation starts in the background. You can check the status of the export, and download your output file in the Export history page. For more information, see The export history page.

The whole operation is processed directly on the Hadoop cluster.

The export process triggers a refresh in the data that is fetched from the cluster, guaranteeing that the data displayed in the output is always up to date.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here