When you are finished preparing your dataset extracted from HDFS, you have the
possibility to export it back directly to the cluster, or download it as a local
file.
Note that the cluster where you will export your cleansed data, must be the same cluster
from which you imported the data in the first place.
Procedure
-
Click the Export button in the application header
bar.
-
If the result of your preparation is larger than your current sample size, 10
000 rows by default, select an export option:
- If you select Current sample, only the sample you
have been working on will be exported, as a local file with separator,
xlsx or tableau file.
- If you select All data, all the preparations
steps you have performed on your sample will be applied to the rest of the
dataset as well, and the HDFS export will be enabled.
-
Select HDFS.
-
In the Format field, select the output format for your
data.
For HDFS files, Talend Data Preparation supports files
with separator, AVRO and
PARQUET.
If you choose file with separator, select the delimiter to use for the output
file.
-
In the Output path field, enter the complete URL to your
prefered location on the cluster to save the exported file.
-
If you chose to authenticate via a custom keytab, enter your principal and the
path to your keytab file.
The path must point to a keytab file that is accessible
to all the workers on the cluster.
-
Click Confirm.
Note that if a preparation contains actions that only affect a single row, or
cells, they will be skipped during the export process. The Make
as header or Delete Row functions for
example do not work in a Big Data context. A warning will be displayed
before the export if your preparation contains such actions.
Results
If you chose to export your sample as a local file, your download of the output file
directly starts.
In the case of a full export, whether it is as a local file or to the cluster, the
export operation starts in the background. You can check the status of the export,
and download your output file in the Export history page. For
more information, see The export history page.
The whole operation is processed directly on the Hadoop cluster.
The export process triggers a refresh in the data that is fetched from the cluster,
guaranteeing that the data displayed in the output is always up to date.