Configuring the input data

The tFileInputDelimited components are configured to load data from DBFS into the Job.

Before you begin

The source files, movies.csv and directors.txt have been uploaded into DBFS as explained in Uploading files to DBFS (Databricks File System).
The metadata of the movie.csv file has been set up under the File delimited node in the Repository.

If you have not done so, see Preparing the movies metadata to create the metadata.

Procedure

Expand the File delimited node under the Metadata node in the Repository and then the movies file connection node and its child node to display the movies schema metadata node.
Double-click this schema metadata node to open its wizard.
Click the button to export the schema to a local directory.
Double-click the movie tFileInputDelimited component to open its Component view.
Ensure that the Define a storage configuration component check box is clear. This allows this component to directly read data from the file system of the Spark cluster to be defined later in the Spark configuration tab; In this scenario, this file system is DBFS.
Click Edit schema to open the editor of the schema and click the button to import the schema of the movie data you exported previously from the File delimited metadata in Repository.
In the Folder/File field, enter the path pointing to the movie data stored in DBFS.
In the Header field, enter 1 without any quotation marks. This allows the component to recognize the first row of the data as data header.
Double-click the director tFileInputDelimited component to open its Component view.
Ensure that the Define a storage configuration component check box is clear for the same reason as explained in the previous steps.
Click the [...] button next to Edit schema to open the schema editor.
Click the [+] button twice to add two rows and in the Column column, rename them to ID and Name, respectively.
Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.
In the Folder/File field, enter the directory where the director data is stored. As is explained in Uploading files to DBFS (Databricks File System), this data has been written in /FileStore/ychen/movie_library/directors.txt.
In Field separator field, enter a comma (,) as this is the separator used by the director data.

Results

The input components are now configured to load the movie data and the director data to the Job.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here