Transfering data from HDFS to Amazon S3 - Spark framework - 7.0

Spark Batch

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Design and Development > Designing Jobs > Job Frameworks > Spark Batch
EnrichPlatform
Talend Studio

Procedure

  1. Add a tHDFSConfiguration component and configure it with the HDFS connection metadata from your repository:
  2. To read your file, use a tFileInputDelimited component.
    Note: Note that the components to read and write files in Big Data Batch - Spark Jobs are generic, and can be used with any storage.
  3. To specify which file system to use, click the Define a storage configuration component option and select tHDFSConfiguration in the list:
  4. To save your processed data to Amazon S3, add a tS3Configuration component.
  5. Provide your Amazon credentials and the bucket name:
  6. Add a tFileOutputDelimited component to write your data. In the Component view of tFileOutputDelimited, specify storage on Amazon S3:

    Be careful when writing the folder name. You must have a slash symbol "/" before the folder name.

  7. Configure your Job to use the Amazon EMR cluster metadata:
  8. Run your Job and verify that the new folder has been created on Amazon S3:
    Tip:

    If you can’t read from or write to Amazon S3 from a Big Data Batch Job (Spark or MapReduce), you may have to update the policy attached to the EMR_EC2_DefaultRole. This role was created when you launched the Amazon EMR cluster for the first time:

    This can be achieved from theAmazon Web Services > Identity & Access Management > Roles.

    For more information about role policies, see Using IAM Roles with Amazon EC2 Instances.