Procedure
-
Add a tHDFSConfiguration component and configure it with the
HDFS connection metadata from your repository:
-
To read your file, use a tFileInputDelimited component.
Note: Note that the components to read and write files in Big Data Batch - Spark Jobs are generic, and can be used with any storage.
-
To specify which file system to use, click the Define a storage
configuration component option and select
tHDFSConfiguration in the list:
- To save your processed data to Amazon S3, add a tS3Configuration component.
-
Provide your Amazon credentials and the bucket name:
-
Add a tFileOutputDelimited component to write your data. In the
Component view of tFileOutputDelimited,
specify storage on Amazon S3:
Be careful when writing the folder name. You must have a slash symbol "/" before the folder name.
-
Configure your Job to use the Amazon EMR cluster metadata:
-
Run your Job and verify that the new folder has been created on Amazon S3:
Tip:
If you can’t read from or write to Amazon S3 from a Big Data Batch Job (Spark or MapReduce), you may have to update the policy attached to the EMR_EC2_DefaultRole. This role was created when you launched the Amazon EMR cluster for the first time:
This can be achieved from theAmazon Web Services > Identity & Access Management > Roles.
For more information about role policies, see Using IAM Roles with Amazon EC2 Instances.