Scenario: Working with Amazon S3 on an Amazon EMR cluster

EnrichVersion
Cloud
6.4
EnrichProdName
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Open Studio for MDM
Talend Open Studio for Data Integration
Talend ESB
Talend Open Studio for ESB
Talend Open Studio for Big Data
Talend Data Services Platform
Talend Big Data
Talend Data Fabric
Talend Data Management Platform
Talend Data Integration
Talend Big Data Platform
task
Data Quality and Preparation > Third-party systems > Amazon services (Integration) > Amazon EMR components
Design and Development > Designing Jobs > Hadoop distributions > Amazon EMR
Design and Development > Third-party systems > Amazon services (Integration) > Amazon S3 components
Data Quality and Preparation > Third-party systems > Amazon services (Integration) > Amazon S3 components
Design and Development > Third-party systems > Amazon services (Integration) > Amazon EMR components
Data Governance > Third-party systems > Amazon services (Integration) > Amazon S3 components
Data Governance > Third-party systems > Amazon services (Integration) > Amazon EMR components
EnrichPlatform
Talend Studio

Scenario: Working with Amazon S3 on an Amazon EMR cluster

This article shows how to work with Amazon S3 on an Amazon EMR cluster.

This example uses Talend Real-time Big Data Platform 6.1. In addition, it uses these licensed products provided by Amazon:

  • Amazon EC2
  • Amazon EMR
  • Amazon S3

Creating an Amazon S3 bucket

The following instructions describe how to create an Amazon S3 bucket from a Standard Job in the Talend Studio.

Before you begin

As for any Amazon service, you will need your Amazon credentials. You can enter your credentials in the Access Key and Secret Key fields, or use context variables, as mentioned in the Amazon EMR - Getting Started article.

Procedure

  1. In a Standard Job, add a tS3Connection component and open the Component view.
  2. You may choose any resource location; however, we recommend not using the EU (Frankfurt) region:
  3. To check if a bucket already exists, you can use the tS3BucketExist component. Add a tS3BucketExist component and connect it to the tS3Connection component with an OnComponentOk trigger.

    This component can reuse the connection information from the tS3Connection component:

  4. This component reports the existence of the S3 bucket as a Boolean global variable namedĀ tS3BucketExist_1.BUCKET_EXIST:
  5. To create the bucket if it doesn't already exist, add a tS3BucketCreate component and connect it to tS3BucketExist with a Run If trigger.

    Set the condition on the Run If trigger as follows:

    This condition means that the tS3BucketCreate will execute only if the bucket does not exist.

  6. Open the Component view of the tS3BucketCreate component and provide the bucket name:
  7. Run your Job.
  8. When the Job has executed, from the Amazon Web Services home page, navigate to S3:

    You will reach the Amazon S3 home page, where you can see the list of your buckets. You should find your new bucket created from the Talend Studio:

    In a similar way, you can delete an existing bucketĀ using the tS3BucketDelete component.

    Warning:

    Make sure that your bucket name is unique. Otherwise, you may inadvertently work on someone else's bucket.

Putting files in an Amazon S3 Bucket

Before you begin

Using a Standard Job, you can easily copy files to and from an Amazon S3 bucket.

Procedure

  1. Add a tS3Put component and connect it to your S3 bucket, using an existing connection or by providing your Amazon credentials.
  2. Provide the Bucket name and the Key. The key is the name of the file created in the S3 bucket.
  3. In the File field, enter the path of the local file to be copied:
  4. Run your Job. Verify that the file has been successfully written in your bucket by navigating to your Amazon S3 web page.
  5. Click your bucket to visualize its content:

Getting files from an Amazon S3 bucket

Before you begin

The reverse operation will be achieved with a tS3Get component.

Procedure

  1. This component will copy a file from an S3 bucket to your local file system:
  2. Run the Job.

    Verify that the file has been successfully imported by browsing to your local file system: