Link the components to construct the data flow.
Procedure
- In the Integration perspective of the Studio, create an empty Spark Batch Job from the Job Designs node in the Repository tree view.
-
In the workspace, enter the name of the component to be used and select this
component from the list that appears. In this scenario, the components are
tHDFSConfiguration (labeled emr_hdfs), tS3Configuration, tFixedFlowInput, tAggregateRow and tFileOutputParquet.
The tFixedFlowInput component is used to load the sample data into the data flow. In the real-world practice, use the input component specific to the data format or the source system to be used instead of tFixedFlowInput.
- Connect tFixedFlowInput, tAggregateRow and tFileOutputParquet using the Row > Main link.
- Leave the tHDFSConfiguration component and the tS3Configuration component alone without any connection.