Before you begin
The source files, movies.csv and directors.txt have been uploaded to DBFS as explained in Uploading files to DBFS (Databricks File System).
The metadata of the movie.csv file has been set up under the File delimited node in the Repository.
If you have not done so, see Preparing the movies metadata to create the metadata.
Double-click tDBFSConnection to open its Component view.
- In the Endpoint field, enter the URL address of your Azure Databricks workspace. This URL can be found in the Overview blade of your Databricks workspace page on your Azure portal. For example, this URL could look like https://adb-$workspaceId.$random.azuredatabricks.net.
- Click the [...] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For further information, see Personal access tokens from the Azure documentation.
Double-click tDBFSGet to open its Component view.
- Select Use an existing connection to use the connection information defined in tDBFSConnection.
- In the DBFS directory field, enter the path to the directory in DBFS in which the files about movies and their directors are stored.
In the Local directory field,
enter the path, or browse to the folder in which the files to be downloaded from DBFS
This directory is created on the fly if it does not exist.
- From the Overwrite file drop-down list, select always to overwrite the files if they already exist in the target directory in the local file system.
- In the Files table, add one row by clicking the [+] button in order to define the criteria to select the files to be copied.
- In the Filemask column, enter an asterisk (*) within the double quotation marks to make tDBFSGet select all the files stored in the folder you specified in the Local directory field.
- Leave the New name column empty, that is to say, keep the default double quotation marks as is, so as to make the name of the files unchanged after being uploaded.
- Expand the File delimited node under the Metadata node in the Repository to display the movies schema metadata node you have set up as explained in Preparing the movies metadata.
- Drop this schema metadata node onto the movie tFileInputDelimited component in the workspace of the Job.
Double-click the movie
component to open its Component view.
This tFileInputDelimited has automatically reused the movie metadata from the Repository to define the related parameters in its Basic settings view.
Click the File name/Stream
field to open the Edit parameter using
repository dialog box to update the field separator.
This tFileInputDelimited is reusing the default file location which you have defined for the File delimited metadata. You need to change it to read the movie file from theh directory in which this file is downloaded from DBFS.
Select Change to built-in
property and click OK to validate your choice.
The File name/Stream field becomes editable.
- Enter the directory where the the movie file downloaded from DBFS is stored
Double-click the director
component to open its Component view.
- Click the [...] button next to Edit schema to open the schema editor.
Click the [+] button twice to
add two rows and in the Column column,
rename them to ID and Name, respectively.
- Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.
- In the File name/Stream field, enter the directory where the data about the movie directors is stored.
- In the Field separator field, enter a comma (,) within double quotation marks.
The tFileInputDelimited components are now configured to load the movie data and the director data to the Job.