Loading/unloading data from/to Amazon S3 - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a Job that generates a delimited file and uploads the file to S3, loads data from the file on S3 to Redshift and displays the data on the console, then unloads the data from Redshift to files on S3 per slice of the Redshift cluster, and finally lists and gets the unloaded files on S3.

Prerequisites:

The following context variables have been created and saved in the Repository tree view. For more information about context variables, see Talend Studio User Guide.

  • redshift_host: the connection endpoint URL of the Redshift cluster.

  • redshift_port: the listening port number of the database server.

  • redshift_database: the name of the database.

  • redshift_username: the username for the database authentication.

  • redshift_password: the password for the database authentication.

  • redshift_schema: the name of the schema.

  • s3_accesskey: the access key for accessing Amazon S3.

  • s3_secretkey: the secret key for accessing Amazon S3.

  • s3_bucket: the name of the Amazon S3 bucket.

Note that all context values in the above screenshot are for demonstration purposes only.

Adding and linking components

  1. Create a new Job and apply all context variables listed above to the new Job.

  2. Add the following components by typing their names in the design workspace or dropping them from the Palette: a tRowGenerator component, a tRedshiftOutputBulk component, a tRedshiftBulkExec component, a tRedshiftInput component, a tLogRow component, a tRedshiftUnload component, a tS3List component, and a tS3Get component.

  3. Link tRowGenerator to tRedshiftOutputBulk using a Row > Main connection.

  4. Do the same to link tRedshiftInput to tLogRow.

  5. Link tS3List to tS3Get using a Row > Iterate connection.

  6. Link tRowGenerator to tRedshiftBulkExec using a Trigger > On Subjob Ok connection.

  7. Do the same to link tRedshiftBulkExec to tRedshiftInput, link tRedshiftInput to tRedshiftUnload, link tRedshiftUnload to tS3List.

Configuring the components

Preparing a file and uploading the file to S3

  1. Double-click tRowGenerator to open its RowGenerator Editor.

  2. Click the [+] button to add two columns: ID of Integer type and Name of String type.

  3. Click the cell in the Functions column and select a function from the list for each column. In this example, select Numeric.sequence to generate sequence numbers for the ID column and select TalendDataGenerator.getFirstName to generate random first names for the Name column.

  4. In the Number of Rows for RowGenerator field, enter the number of data rows to generate. In this example, it is 20.

  5. Click OK to close the schema editor and accept the propagation prompted by the pop-up dialog box.

  6. Double-click tRedshiftOutputBulk to open its Basic settings view on the Component tab.

  7. In the Data file path at local field, specify the local path for the file to be generated. In this example, it is E:/Redshift/redshift_bulk.txt.

  8. In the Access Key field, press Ctrl + Space and from the list select context.s3_accesskey to fill in this field.

    Do the same to fill the Secret Key field with context.s3_accesskey and the Bucket field with context.s3_bucket.

  9. In the Key field, enter a new name for the file to be generated after being uploaded on Amazon S3. In this example, it is person_load.

Loading data from the file on S3 to Redshift

  1. Double-click tRedshiftBulkExec to open its Basic settings view on the Component tab.

  2. In the Host field, press Ctrl + Space and from the list select context.redshift_host to fill in this field.

    Do the same to fill:

    • the Port field with context.redshift_port,

    • the Database field with context.redshift_database,

    • the Schema field with context.redshift_schema,

    • the Username field with context.redshift_username,

    • the Password field with context.redshift_password,

    • the Access Key field with context.s3_accesskey,

    • the Secret Key field with context.s3_secretkey, and

    • the Bucket field with context.s3_bucket.

  3. In the Table Name field, enter the name of the table to be written. In this example, it is person.

  4. From the Action on table list, select Drop table if exists and create.

  5. In the Key field, enter the name of the file on Amazon S3 to be loaded. In this example, it is person_load.

  6. Click the [...] button next to Edit schema and in the pop-up window define the schema by adding two columns: ID of Integer type and Name of String type.

Retrieving data from the table on Redshift

  1. Double-click tRedshiftInput to open its Basic settings view on the Component tab.

  2. Fill the Host, Port, Database, Schema, Username, and Password fields with their corresponding context variables.

  3. In the Table Name field, enter the name of the table to be read. In this example, it is person.

  4. Click the [...] button next to Edit schema and in the pop-up window define the schema by adding two columns: ID of Integer type and Name of String type.

  5. In the Query field, enter the following SQL statement based on which the data are retrieved.

    "SELECT * FROM" + context.redshift_schema + "person ORDER BY \"ID\""
  6. Double-click tLogRow to open its Basic settings view on the Component tab.

  7. In the Mode area, select Table (print values in cells of a table) for a better display of the result.

Unloading data from Redshift to file(s) on S3

  1. Double-click tRedshiftUnload to open its Basic settings view on the Component tab.

  2. Fill the Host, Port, Database, Schema, Username, and Password fields with their corresponding context variables.

    Fill the Access Key, Secret Key, and Bucket fields also with their corresponding context variables.

  3. In the Table Name field, enter the name of the table from which the data will be read. In this example, it is person.

  4. Click the [...] button next to Edit schema and in the pop-up window define the schema by adding two columns: ID of Integer type and Name of String type.

  5. In the Query field, enter the following SQL statement based on which the result will be unloaded.

    "SELECT * FROM person"
  6. In the Key prefix field, enter the name prefix for the unload files. In this example, it is person_unload_.

Retrieving files unloaded to Amazon S3

  1. Double-click tS3List to open its Basic settings view on the Component tab.

  2. Fill the Access Key and Secret Key fields with their corresponding context variables.

  3. From the Region list, select the AWS region where the unload files are created. In this example, it is US Standard.

  4. Clear the List all buckets objects check box, and click the [+] button under the table displayed to add one row.

    Fill in the Bucket name column with the name of the bucket in which the unload files are created. In this example, it is the context variable context.s3_bucket.

    Fill in the Key prefix column with the name prefix for the unload files. In this example, it is person_unload_.

  5. Double-click tS3Get to open its Basic settings view on the Component tab.

  6. Fill the Access Key field and Secret Key field with their corresponding context variables.

  7. From the Region list, select the AWS region where the unload files are created. In this example, it is US Standard.

  8. In the Bucket field, enter the name of the bucket in which the unload files are created. In this example, it is the context variable context.s3_bucket.

    In the Key field, enter the name of the unload files by pressing Ctrl + Space and from the list selecting the global variable ((String)globalMap.get("tS3List_1_CURRENT_KEY")).

  9. In the File field, enter the local path where the unload files are saved. In this example, it is "E:/Redshift/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY")).

Saving and executing the Job

  1. Press Ctrl + S to save the Job.

  2. Execute the Job by pressing F6 or clicking Run on the Run tab.

    As shown above, the generated data is written into the local file redshift_bulk.txt, the file is uploaded on S3 with the new name person_load, and then the data is loaded from the file on S3 to the table person in Redshift and displayed on the console. After that, the data is unloaded from the table person in Redshift to two files person_unload_0000_part_00 and person_unload_0001_part_00 on S3 per slice of the Redshift cluster, and finally the unloaded files on S3 are listed and retrieved in the local folder.