You orchestrate the Spark Batch components in the Job workspace in
order to design a data transformation process that runs in the Apache Spark Batch
In the Job, enter the name of the component to be
used and select this component from the list that appears. In this scenario, the
components are two tFileInputDelimited components, a tMap component, two tFileOutputParquet components and a
The tFileInputDelimited components are
used to load the movie data and the director data, respectively,
from the DBFS file system of your Databricks Big Data platform into
the data flow of the current Job.
The tMap component is used to transform
the input data.
The tFileOutputParquet components write
the results in a directory in your Azure Data Lake Storage
- The tAzureFSConfiguration component provides the necessary
information to connect to your Azure Data Lake Storage system.
Double-click one of the two tFileInputDelimited component to
make this label editable and then enter movie to change the label of this component.
Do the same to label the other tFileInputDelimited to director.
Right click the tFileInputDelimited component that is labelled
movie, then from the
contextual menu, select Row >
Main and click tMap to connect it to tMap. This is the main link through which the movie data is
sent to tMap.
Do the same to connect the director
tFileInputDelimited component to tMap using the Row > Main link. This is the Lookup link through which
the director data is sent to tMap as lookup
Do the same to connect the tMap component to one of the tFileOutputParquet using the
Row > Main link, then
in the pop-up wizard, name this link to out1 and click OK to validate this change.
Repeat these operations to connect the tMap component to the other
component using the Row >
Main link and name it to reject.
In the workspace, the whole Job looks like this: