Transfering data from HDFS to Amazon S3 - YARN framework

Amazon S3

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
Talend Data Integration
Talend ESB
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Open Studio for Big Data
Talend Open Studio for ESB
Talend Data Services Platform
Talend Big Data
Talend MDM Platform
Talend Open Studio for Data Integration
task
Data Quality and Preparation > Third-party systems > Amazon services (Integration) > Amazon S3 components
Design and Development > Third-party systems > Amazon services (Integration) > Amazon S3 components
Data Governance > Third-party systems > Amazon services (Integration) > Amazon S3 components
EnrichPlatform
Talend Studio
In this example, the data saved in HDFS are processed, and the processed output is saved in an existing Amazon S3 bucket.

Before you begin

This processing will be achieved using a Big Data Batch Job using the MapReduce framework.

Procedure

  1. To read the data from HDFS, use a tHDFSInput component.
    Note: Configure it using the HDFS connection metadata from your repository, as mentioned in Amazon EMR - Getting Started.
  2. Provide the schema and the path of the file saved on HDFS:
  3. Then, the data can be processed using any processing component available in the Palette, such as tMap, tAggregateRow, tSortRow, or others.
    Note: Note that without at least one processing component, your Job execution will fail, because no MapReduce tasks will be generated and submitted to the cluster.
  4. To write your data to an Amazon S3 bucket, use the tS3Output component.
    Note: The difference between tS3Put and tS3Output is that tS3Put copies a local file to Amazon S3, whereas the tS3Output component receives data processed by its preceding component and writes data into a given Amazon S3 file system.
  5. To configure the tS3Output component, you will need your Amazon credentials, and the bucket and folder names where the data will be written:

  6. Because you are designing a Big Data Batch Job, you must configure the connection to your Amazon EMR cluster before you run the Job.
    In the Run view of your Job, click the Hadoop Configuration tab and use the cluster connection metadata:
  7. Run your Job and check the results in your Amazon S3 bucket:
  8. Open the destination folder to view more details:
    Tip:

    If you can’t read from or write to Amazon S3 from a Big Data Batch Job (Spark or MapReduce), you may have to update the policy attached to the EMR_EC2_DefaultRole. This role was created when you launched the Amazon EMR cluster for the first time:

    This can be achieved from the Amazon Web Services > Identity & Access Management > Roles.

    For more information about role policies, see Using IAM Roles with Amazon EC2 Instances.