You can access data stored on HDFS (Hadoop File System), directly from the
Talend Data Preparation
interface and import it in the form of a dataset.
Procedure
-
In the Datasets view of the Talend Data Preparation homepage,
click the white arrow next to the Add Dataset
button.
-
Select HDFS.
The Add an HDFS dataset form opens.
-
In the Dataset name field, enter the name you want to
give your dataset.
-
In the User name field enter your Linux user name.
This user must have the reading rights on the file that you want to
import.
-
To enable Kerberos authentication, select the Use
Kerberos check box.
-
In the Principal
-
In the Keytab file
field, enter the location of your keytab file.
You can manually configure Talend Data Preparation to
display a default value in those fields.
-
In the Format field, select the format that corresponds
to the file that you want to import.
For HDFS files,
Talend Data Preparation supports
CSV,
AVRO and
PARQUET.
Warning:
Talend Data Preparation
does not support the import of PARQUET
files that contain data with the INT96
type. We recommend adjusting your source file if that is the case.
If you choose CSV,
select the record and field delimiter, as well as the text enclosure and
escape character, and the encoding for the file you want to
import.
-
In the Path field, enter the complete URL of your file
in the Hadoop cluster.
-
Click the Add Dataset button.
Results
The data extracted from the cluster directly opens in the grid and you can start
working on your preparation.
The data is still stored in the cluster and doesn't leave it, Talend Data Preparation only
retrieves a sample on-demand.
Your dataset is now available in the Datasets view of the
application home page.