In this example, the data saved in HDFS are processed, and the processed output is
saved in an existing Amazon S3 bucket.
Before you begin
This processing will be achieved using a Big Data Batch Job using the MapReduce framework.
Procedure
-
To read the data from HDFS, use a tHDFSInput
component.
Note: Configure it using the HDFS connection metadata from your repository, as mentioned in Amazon EMR - Getting Started.
-
Provide the schema and the path of the file saved on HDFS:
-
Then, the data can be processed using any processing component available in the
Palette, such as tMap,
tAggregateRow, tSortRow, or
others.
Note: Note that without at least one processing component, your Job execution will fail, because no MapReduce tasks will be generated and submitted to the cluster.
-
To write your data to an Amazon S3 bucket, use the
tS3Output component.
Note: The difference between tS3Put and tS3Output is that tS3Put copies a local file to Amazon S3, whereas the tS3Output component receives data processed by its preceding component and writes data into a given Amazon S3 file system.
-
To configure the tS3Output component, you will need your
Amazon credentials, and the bucket and folder names where the data will be
written:
-
Because you are designing a Big Data Batch Job, you must configure the connection to your Amazon EMR cluster before you run the Job.
In the Run view of your Job, click the Hadoop Configuration tab and use the cluster connection metadata:
-
Run your Job and check the results in your Amazon S3 bucket:
-
Open the destination folder to view more details:
Tip:
If you can’t read from or write to Amazon S3 from a Big Data Batch Job (Spark or MapReduce), you may have to update the policy attached to the EMR_EC2_DefaultRole. This role was created when you launched the Amazon EMR cluster for the first time:
This can be achieved from the Amazon Web Services > Identity & Access Management > Roles.
For more information about role policies, see Using IAM Roles with Amazon EC2 Instances.