Writing the aggregated data about street incidents to EMR - 7.3

Amazon EMR distribution

EnrichVersion
Cloud
7.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for ESB
Talend Real-Time Big Data Platform
EnrichPlatform
Talend Studio
task
Design and Development > Designing Jobs > Hadoop distributions > Amazon EMR

Procedure

  1. Double-click the tFileOutputParquet component to open its Component view.

    Example

  2. Select the Define a storage configuration component check box and then select the tS3Configuration component you configured in the previous steps.
  3. Click Sync columns to ensure that tFileOutputParquet retrieve the schema from the output side of tAggregateRow.
  4. In the Folder/File field, enter the name of the folder to be used to store the aggregated data in the S3 bucket specified in tS3Configuration. For example, enter /sample_user, then at runtime, the folder called sample_user at the root of the bucket is used to store the output of your Job.
  5. From the Action drop-down list, select Create if the folder to be used does not exist yet in the bucket to be used; if this folder already exists, select Overwrite.
  6. Click Run to open its view and then click the Spark Configuration tab to display its view for configuring the Spark connection.
  7. Select the Use local mode check box to test your Job locally before eventually submitting it to the remote Spark cluster.

    In the local mode, the Studio builds the Spark environment in itself on the fly in order to run the Job in. Each processor of the local machine is used as a Spark worker to perform the computations.

  8. In this mode, your local file system is used; therefore, deactivate the configuration components such as tS3Configuration or tHDFSConfiguration that provides connection information to a remote file system, if you have placed these components in your Job.
  9. In the Component view of tFileOutputParquet, change the file path in the Folder/File field to a local directory and adapt the action to be taken on the Action drop-down list, that is to say, creating a new folder or overwriting the existing one.
  10. On the Run tab, click Basic Run and in this view, click Run to execute your Job locally to test its design logic.
  11. When your Job runs successfully, clear the Use local mode check box in the Spark Configuration view of the Run tab, then in the design workspace of your Job, activate the configuration components and revert the changes you just made in tFileOutputParquet for the local test.