The DBFS components and the two
tFileInputDelimited components are configured to load data from
DBFS into the Job.
Procedure
-
Double-click tDBFSConnection to open its Component view.
Example
-
In the Endpoint field, enter the URL address of your Azure Databricks workspace. This URL can be found in the Overview blade of your Databricks workspace page on your Azure portal. For example, this URL could look like https://adb-$workspaceId.$random.azuredatabricks.net.
-
Click the [...] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For further information, see Personal access tokens from the Azure documentation.
-
Double-click tDBFSGet to open its Component view.
Example
-
Select Use an existing connection to use the connection information defined in tDBFSConnection.
-
In the DBFS directory field, enter the path to the directory in
DBFS in which the files about movies and their directors are stored.
-
In the Local directory field,
enter the path, or browse to the folder in which the files to be downloaded from DBFS
are stored.
This directory is created on the fly if it does
not exist.
-
From the Overwrite file drop-down list, select always to overwrite the files if they already
exist in the target directory in the local file system.
-
In the Files table, add one
row by clicking the [+] button in order to
define the criteria to select the files to be copied.
-
In the Filemask column, enter an asterisk (*) within the double
quotation marks to make tDBFSGet select all
the files stored in the folder you specified in the Local directory field.
-
Leave the New name column
empty, that is to say, keep the default double quotation marks as is, so as to
make the name of the files unchanged after being uploaded.
-
Expand the File delimited node under the Metadata node in the Repository to display the movies schema metadata node you have set up as
explained in Preparing the movies metadata.
-
Drop this schema metadata node onto the movie
tFileInputDelimited
component in the workspace of the Job.
-
Double-click the movie
tFileInputDelimited
component to open its Component view.
This tFileInputDelimited has automatically reused the movie
metadata from the Repository to define the related parameters in its
Basic settings
view.
-
Click the File name/Stream
field to open the Edit parameter using
repository dialog box to update the field separator.
This tFileInputDelimited
is reusing the default file location which you have defined for the
File delimited metadata. You need to change it to
read the movie file from theh directory in which this
file is downloaded from DBFS.
-
Select Change to built-in
property and click OK to validate your choice.
The File name/Stream
field becomes editable.
-
Enter the directory where the the movie file downloaded
from DBFS is stored
-
Double-click the director
tFileInputDelimited
component to open its Component view.
-
Click the [...] button next to
Edit schema to open the schema
editor.
-
Click the [+] button twice to
add two rows and in the Column column,
rename them to ID and Name, respectively.
-
Click OK to validate these
changes and accept the propagation prompted by the pop-up dialog box.
-
In the File name/Stream field, enter the directory where the data
about the movie directors is stored.
-
In the Field separator field, enter a comma (,) within double
quotation marks.
Results
The tFileInputDelimited components are now configured to load the movie
data and the director data to the Job.