You will access your data stored on HDFS (Hadoop File System), directly from the
Talend Data Preparation
interface and import it in the form of a dataset.
Procedure
-
In the Datasets view of the Talend Data Preparation
homepage, click the white arrow next to the Add Dataset
button.
-
Select HDFS.
The Add a HDFS dataset form opens.
-
In the Dataset name field, enter the name you want to
give your dataset., HDFS_dataset in this example.
-
In the User name field enter the name of the Linux user
on the cluster.
This user must have the reading rights on the file that you want to
import.
-
For this example, leave the Use Kerberos check box
unselected.
If you chose to authenticate via Kerberos, enter your principal and the path
to your keytab file.
The keytab file must be accessible by the Spark Job Server.
You can manually configure Talend Data Preparation to display a
default value in those
fields.
-
In the Format field, select the format in which your
data was stored in the cluster, .csv in this case.
-
In the Path field, enter the complete URL of your file
in the Hadoop cluster.
-
Click Add Dataset.
Results
The data extracted from the cluster directly opens in the grid and you can start
working on your preparation.
The data is still stored in the cluster and doesn't leave it, Talend Data Preparation only
retrieves a sample on-demand.
Your dataset is now available in the Datasets view of the
application home page.