Transfering data from HDFS to Amazon S3 - Spark framework

Amazon S3

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
Talend Data Integration
Talend ESB
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Open Studio for Big Data
Talend Open Studio for ESB
Talend Data Services Platform
Talend Big Data
Talend MDM Platform
Talend Open Studio for Data Integration
task
Data Quality and Preparation > Third-party systems > Amazon services (Integration) > Amazon S3 components
Design and Development > Third-party systems > Amazon services (Integration) > Amazon S3 components
Data Governance > Third-party systems > Amazon services (Integration) > Amazon S3 components
EnrichPlatform
Talend Studio

Procedure

  1. Add a tHDFSConfiguration component and configure it with the HDFS connection metadata from your repository:
  2. To read your file, use a tFileInputDelimited component.
    Note: Note that the components to read and write files in Big Data Batch - Spark Jobs are generic, and can be used with any storage.
  3. To specify which file system to use, click the Define a storage configuration component option and select tHDFSConfiguration in the list:
  4. To save your processed data to Amazon S3, add a tS3Configuration component.
  5. Provide your Amazon credentials and the bucket name:
  6. Add a tFileOutputDelimited component to write your data. In the Component view of tFileOutputDelimited, specify storage on Amazon S3:

    Be careful when writing the folder name. You must have a slash symbol "/" before the folder name.

  7. Configure your Job to use the Amazon EMR cluster metadata:
  8. Run your Job and verify that the new folder has been created on Amazon S3:
    Tip:

    If you can’t read from or write to Amazon S3 from a Big Data Batch Job (Spark or MapReduce), you may have to update the policy attached to the EMR_EC2_DefaultRole. This role was created when you launched the Amazon EMR cluster for the first time:

    This can be achieved from theAmazon Web Services > Identity & Access Management > Roles.

    For more information about role policies, see Using IAM Roles with Amazon EC2 Instances.