Configuring how to read the input data - 8.0

Talend Open Studio for Big Data Getting Started Guide

Version
8.0
Language
English
EnrichDitaval
Open Studio for Big Data
Product
Talend Open Studio for Big Data
Module
Talend Studio
Content
Design and Development
Installation and Upgrade
The DBFS components and the two tFileInputDelimited components are configured to load data from DBFS into the Job.

Before you begin

Procedure

  1. Double-click tDBFSConnection to open its Component view.

    Example

  2. In the Endpoint field, enter the URL address of your Azure Databricks workspace. This URL can be found in the Overview blade of your Databricks workspace page on your Azure portal. For example, this URL could look like https://adb-$workspaceId.$random.azuredatabricks.net.
  3. Click the [...] button next to the Token field to enter the authentication token generated for your Databricks user account. You can generate or find this token on the User settings page of your Databricks workspace. For further information, see Personal access tokens from the Azure documentation.
  4. Double-click tDBFSGet to open its Component view.

    Example

  5. Select Use an existing connection to use the connection information defined in tDBFSConnection.
  6. In the DBFS directory field, enter the path to the directory in DBFS in which the files about movies and their directors are stored.
  7. In the Local directory field, enter the path, or browse to the folder in which the files to be downloaded from DBFS are stored.

    This directory is created on the fly if it does not exist.

  8. From the Overwrite file drop-down list, select always to overwrite the files if they already exist in the target directory in the local file system.
  9. In the Files table, add one row by clicking the [+] button in order to define the criteria to select the files to be copied.
  10. In the Filemask column, enter an asterisk (*) within the double quotation marks to make tDBFSGet select all the files stored in the folder you specified in the Local directory field.
  11. Leave the New name column empty, that is to say, keep the default double quotation marks as is, so as to make the name of the files unchanged after being uploaded.
  12. Expand the File delimited node under the Metadata node in the Repository to display the movies schema metadata node you have set up as explained in Preparing the movies metadata.
  13. Drop this schema metadata node onto the movie tFileInputDelimited component in the workspace of the Job.
  14. Double-click the movie tFileInputDelimited component to open its Component view.

    This tFileInputDelimited has automatically reused the movie metadata from the Repository to define the related parameters in its Basic settings view.

  15. Click the File name/Stream field to open the Edit parameter using repository dialog box to update the field separator.
    This tFileInputDelimited is reusing the default file location which you have defined for the File delimited metadata. You need to change it to read the movie file from theh directory in which this file is downloaded from DBFS.
  16. Select Change to built-in property and click OK to validate your choice.
    The File name/Stream field becomes editable.
  17. Enter the directory where the the movie file downloaded from DBFS is stored
  18. Double-click the director tFileInputDelimited component to open its Component view.
  19. Click the [...] button next to Edit schema to open the schema editor.
  20. Click the [+] button twice to add two rows and in the Column column, rename them to ID and Name, respectively.
  21. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.
  22. In the File name/Stream field, enter the directory where the data about the movie directors is stored.
  23. In the Field separator field, enter a comma (,) within double quotation marks.

Results

The tFileInputDelimited components are now configured to load the movie data and the director data to the Job.