In the Repository, setting up the metadata of a file stored in HDFS allows you to directly reuse its schema in a related Big Data component without having to define each related parameter manually.
Since the movies.csv file you need to process has been stored in the HDFS system being used, you can retrieve its schema to set up its metadata in the Repository.
The schema of the directors.txt file can also be retrieved, but is intentionally ignored in the retrieval procedure explained below, because in this scenario, this directors.txt file is used to demonstrate how to manually define a schema in a Job.
Before you begin
You have launched your Talend Studio and opened the Integration perspective.
The source files movies.csv and directors.txt have been uploaded into HDFS as explained in Uploading files to HDFS.
The connection to the Hadoop cluster to be used and the connection to the HDFS system of this cluster have been set up from the Hadoop cluster node in the Repository.
The Hadoop cluster to be used has been properly configured and is running and you have the proper access permission to that distribution and the HDFS folder to be used.
Ensure that the client machine on which the Talend Studio is installed can recognize the host names of the nodes of the Hadoop cluster to be used. For this purpose, add the IP address/hostname mapping entries for the services of that Hadoop cluster in the hosts file of the client machine.
For example, if the host name of the Hadoop Namenode server is talend-cdh550.weave.local, and its IP address is 192.168.x.x, the mapping entry reads 192.168.x.x talend-cdh550.weave.local.
- Expand the Hadoop cluster node under Metadata in the Repository tree view.
Expand the Hadoop connection you have created and then the
HDFS folder under it.
In this example, it is the my_cdh Hadoop connection.
Right click the HDFS connection in this HDFS folder and from the contextual menu, select Retrieve schema.
In this scenario, this HDFS connection is named cdh_hdfs.
A Schema wizard is displayed, allowing you to browse to files in HDFS.
Expand the file tree to show the movies.csv
file, from which you need to retrieve the schema, and select it.
In this scenario, the movies.csv file is stored in the following directory: /user/ychen/input_data.
Click Next to display the
retrieved schema in the wizard.
The schema of the movie data is displayed in the wizard and the first row of the data is automatically used as the column names.
If the first row of the data you are using is not used this way, you need to review how you set the Header configuration when you were creating the HDFS connection as explained in Setting up connection to HDFS.
- Click Finish to validate these changes.