Adding a dataset from Amazon S3 - 2.1

Talend Data Preparation User Guide

author
Talend Documentation Team
EnrichVersion
6.4
2.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation

Talend Data Preparation is able to connect to various data sources to create new datasets.

In this example, you want to prepare some customers data that is stored on Amazon S3. You will enter your Amazon S3 connection information, directly in the Talend Data Preparation interface and create a new dataset from this data.

Procedure

  1. In the Datasets view of the Talend Data Preparation homepage, click the white arrow next to the Add Dataset button.
  2. Select From Amazon S3.

    The Add an Amazon S3 dataset form opens.

  3. In the Dataset name field, enter the name you want to give your dataset, Amazon S3 dataset for example.
  4. Select the Specify AWS credentials check box.

    For the sake of this example, you will select the checkbox, but Amazon recommends to specify your credentials using one of the methods listed on the Using the Default Credential Provider Chain page. You will not have to manually enter your AWS credentials each time and you will be able to leave the check box unselected.

    The Amazon ECS container credentials method from this page is not supported for Talend Data Preparation.

    This procedure must be completed on the Components Catalog server, as well as the Spark Job Server if you are using Talend Data Preparation with Big Data.

  5. Enter your Amazon S3 access key and secret key in the corresponding fields.

  6. Click Test connection.

    If the connection is successful, the second part of the form is displayed, where you can select the object to import. If the connection is not successful, an error message is displayed, detailing why the connection failed.

  7. From the Region and Bucket drop-down lists, select the location of your data in Amazon S3.

    You can specify a custom value for the Region field.

  8. In the Object field, enter the path to the dataset to import from your bucket.
  9. Select the format, record delimiter and field delimiter of your data in the corresponding drop-down lists.
  10. Click the Add dataset button at the end of the form.

Results

When the import is done, the data extracted from Amazon S3 directly opens in the grid and you can start working on your preparation the same way you usually do.

The data is still stored in Amazon S3, Talend Data Preparation only retrieves a sample on-demand.

The dataset is added to the list in the Datasets view of the homepage.