When you are finished preparing your dataset extracted from HDFS, you have the possibility to export it back directly to the cluster, or download it as a local file.
Note that the cluster where you will export your cleansed data, must be the same cluster from which you imported the data in the first place.
- Click the Export button in the application header
- If the result of your preparation is larger than your current sample size, 10
000 rows by default, select an export option:
- If you select Current sample, only the sample you have been working on will be exported, as a local csv, xlsx ortableau file.
- If you select All data, all the preparations steps you have performed on your sample will be applied to the rest of the dataset as well, and the HDFS export will be enabled.
- Select HDFS file.
- In the Format field, select the output format for your
For HDFS files, Talend Data Preparation supports CSV, AVRO and PARQUET.
If you choose CSV, select the delimiter to use for the output file.
- In the Path field, enter the complete URL to your prefered location on the cluster to save the exported file.
- If you chose to authenticate via Kerberos, enter your principal and the path to
your keytab file.
The path must point to a keytab file that is accessible to all the workers on the cluster.
- Click Confirm.
Note that if a preparation contains actions that only affect a single row, or cells, they will be skipped during the export process. The Make as header or Delete Row functions for example do not work in a Big Data context. A warning will be displayed before the export if your preparation contains such actions.
If you chose to export your sample as a local file, your download of the output file directly starts.
In the case of a full export, whether it is as a local file or to the cluster, the export operation starts in the background. You can check the status of the export, and download your output file in the Export history page. For more information, see The export history page.
The whole operation is processed directly on the Hadoop cluster.
The export process triggers a refresh in the data that is fetched from the cluster, guaranteeing that the data displayed in the output is always up to date.