Update the components to finalize a data transformation process that
runs in the Spark Streaming framework.
A Kafka cluster is used instead of the DBFS system to provide the streaming movie data to
the Job. The director data is still ingested from DBFS in the lookup flow.
Before you begin
-
The Databricks cluster to be used has been
properly configured and is running.
-
The administrator of the cluster has given
read/write rights and permissions to the username to be used for the access
to the related data and directories in DBFS and the Azure ADLS Gen2 storage
system.
Procedure
-
In the Repository, double-click the aggregate_movie_director_spark_streaming Job to
open it in the workspace.
The
icons indicate that the
components that are used in the original Job do not exist in the current Job
framework, Spark Batch. They are tHDFSInput and tHDFSOutput in this example.
-
Click tHDFSInput to select it and then
in the popup Warning window, click OK to close this window.
-
Press Delete on your keyboard to
remove tHDFSInput.
-
In the Job workspace, enter tFileInputDelimited and select this component from the list that
appears.
tFileInputDelimited is added to the
workspace.
-
Do the same to replace tHDFSOutput
with tFileOutputDelimited.
-
Expand the Hadoop cluster node under the Metadata node in the Repository and then the my_cdh connection node and its child node to
display the movies schema
metadata node you have set up under the HDFS folder.
-
Drop this schema metadata node onto the new
tFileInputDelimited component in the workspace of the Job.
-
Right-click this tFileInputDelimited
component, then from the contextual menu, select Row >
Main and click tMap to connect it to
tMap.
-
Right-click tMap, then from the
context menu, select Row > out1 and click the new
tFileOutputDelimited to connect tMap to this component.
-
Double-click the new tFileOutputDelimited component to open its Component view.
-
In the Folder field, enter or browse
to the directory you need to write the result in. In this scenario, it is /user/ychen/output_data/spark_batch/out, which receives the
records that contain the names of the movie directors.
-
Select the Merge result to single file
check box in order to merge the part- files
typically generated by Spark into one single file.
The Merge file path field is
displayed.
-
In the Merge file path field, enter or
browse to the file into which you want the part-part-
files to merge.
In this scenario, this file is /user/ychen/output_data/spark_batch/out/merged.
-
Double-click the other tFileOutputDelimited component which receives the reject link from tMap to
open its Component view.
-
In the Folder field, set the directory to /user/ychen/output_data/spark_batch/reject.
-
In the Run view, click the Spark configuration tab to verify that the Hadoop/Spark connection metadata
has been properly inherited from the original Job.
You always need to use this Spark
Configuration tab to define the connection to a given Hadoop/Spark distribution
for the whole Spark Batch Job and this connection is effective on a per-Job
basis.
-
If you are not sure that the Spark cluster is able to resolve the
hostname of the machine where the Job is executed, select the Define the driver hostname or IP address check box and in the field
that is displayed, enter the IP address of this machine.
If you leave this check box clear, the Spark cluster looks at the
machine located at 127.0.0.1, that is to say, the machine within the cluster itself
for the Spark driver.
-
Press F6 to run the Job.
Results
The Run view is automatically
opened in the lower part of the Studio and shows the execution progress of this
Job.
Once done, you can check, for example in the web console of your HDFS
system, that the output has been written in HDFS.