Update the components to finalize a data transformation process that runs in the Spark Streaming framework.
A Kafka cluster is used instead of the DBFS system to provide the streaming movie data to the Job. The director data is still ingested from DBFS in the lookup flow.
Before you begin
The Databricks cluster to be used has been properly configured and is running.
The administrator of the cluster has given read/write rights and permissions to the username to be used for the access to the related data and directories in DBFS and the Azure ADLS Gen2 storage system.
In the Repository, double-click the aggregate_movie_director_spark_streaming Job to
open it in the workspace.
The icons indicate that the components that are used in the original Job do not exist in the current Job framework, Spark Batch. They are tHDFSInput and tHDFSOutput in this example.
- Click tHDFSInput to select it and then in the popup Warning window, click OK to close this window.
- Press Delete on your keyboard to remove tHDFSInput.
In the Job workspace, enter tFileInputDelimited and select this component from the list that
tFileInputDelimited is added to the workspace.
- Do the same to replace tHDFSOutput with tFileOutputDelimited.
- Expand the Hadoop cluster node under the Metadata node in the Repository and then the my_cdh connection node and its child node to display the movies schema metadata node you have set up under the HDFS folder.
- Drop this schema metadata node onto the new tFileInputDelimited component in the workspace of the Job.
- Right-click this tFileInputDelimited component, then from the contextual menu, select Row > Main and click tMap to connect it to tMap.
Right-click tMap, then from the
context menu, select Row > out1 and click the new
tFileOutputDelimited to connect tMap to this component.
Double-click the new tFileOutputDelimited component to open its Component view.
- In the Folder field, enter or browse to the directory you need to write the result in. In this scenario, it is /user/ychen/output_data/spark_batch/out, which receives the records that contain the names of the movie directors.
Select the Merge result to single file
check box in order to merge the part- files
typically generated by Spark into one single file.
The Merge file path field is displayed.
In the Merge file path field, enter or
browse to the file into which you want the part-part-
files to merge.
In this scenario, this file is /user/ychen/output_data/spark_batch/out/merged.
Double-click the other tFileOutputDelimited component which receives the reject link from tMap to
open its Component view.
- In the Folder field, set the directory to /user/ychen/output_data/spark_batch/reject.
In the Run view, click the Spark configuration tab to verify that the Hadoop/Spark connection metadata
has been properly inherited from the original Job.
You always need to use this Spark Configuration tab to define the connection to a given Hadoop/Spark distribution for the whole Spark Batch Job and this connection is effective on a per-Job basis.
If you are not sure that the Spark cluster is able to resolve the
hostname of the machine where the Job is executed, select the Define the driver hostname or IP address check box and in the field
that is displayed, enter the IP address of this machine.
If you leave this check box clear, the Spark cluster looks at the machine located at 127.0.0.1, that is to say, the machine within the cluster itself for the Spark driver.
- Press F6 to run the Job.
The Run view is automatically opened in the lower part of the Studio and shows the execution progress of this Job.
Once done, you can check, for example in the web console of your HDFS system, that the output has been written in HDFS.