Using Amazon Redshift with Talend Integration Cloud
Amazon Redshift is a hosted data warehouse product, which is part of the larger cloud computing platform Amazon Web Services. It is built on top of technology from the massive parallel processing (MPP) data warehouse ParAccel by Actian. Redshift differs from Amazon's other hosted database offering, Amazon RDS, by being able to handle analytics workloads on large scale datasets stored by a column-oriented DBMS principle. To be able to handle large scale datasets Amazon is making use of massive parallel processing. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
In this demo example, we will setup a Redshift cluster and will execute queries to create, populate and search in a database using Talend Integration Cloud. In last section we will explore how to develop custom actions against Redshift and make them available in Talend Integration Cloud.
Amazon Redshift Configuration
Log-in to AWS account and in Database section, select Redshift. Create on 'Launch Cluster' in order to create a Redshift cluster.
Customize cluster settings, for this demo, we keep standard port and accept all default settings.
After creating 'Redshift Cluster', wait until it becomes available. After successful setup and server initialization, it will reveal connection strings for the database.
Install open source 'Aginity' or any other Database tool to connect AWS Redshift. This will make it sure that server is ready and shows existing tables and data. If there are some communication problem, please do change Security Group Settings to allow Source IP address.
After successful connection, explore the database. In next section we will create database and will populate it.
Sample Job for Talend Integration Cloud
Step 1: Import Redshift Video Rental Example Actions
- Login to Talend Integration Cloud.
- In Manage section, click on Actions/Jobs and select Import Action from Exchange .
- Search for keyword Redshift .
- Add the actions redshift_videoRental_createDatabase_source , redshift_videoRental_populateDatabase_process_step , redshift_videoRental_searchVideo_process_step and redshift_videoRental_finish_target .
Step 2: Create a Redshift Connection
- Under Connection section, create a new Redshift connection and configure it with the values from your Amazon Redshift Cluster created from the above step.
Step 3: Create a new Flow
- Create a new Flow and add the the newly imported actions to the flow as shown below.
- The order of the actions in the flow is: redshift_videoRental_createDatabase_source --> redshift_videoRental_populateDatabase_process_step --> redshift_videoRental_searchVideo_process_step --> redshift_videoRental_finish_target .
- Select the target component and click on auto mapping. Check that all fields are mapped correctly.
- Click on the redshift_videoRental_searchVideo_process_step Action Search Component and type 'Pulp Fiction' in the Title search field as show below.
- Click on Test to run the flow.
- This flow will create a database in Amazon Redshift and populate the database with sample data. Then it will search for 'Pulp Fiction' in the column database and send it to the target action.
- Click on Preview Data on the redshift_videoRental_finish_target action to see the results of the search.
Step 4: Verify Results
- Use a database client like ' Aginity' to verify the results of executing the flow. You will see the structure of the database created and can review the tables and data they contain.
- The redshift_videoRental_createDatabase_source action create the structure of the database that you can see.
- The redshift_videoRental_populateDatabase_process_step action populates the table with data.
- Search the film 'Pulp Fiction' by executing a Select query as shown below.
Step 5: Clean up
- Shut down your Redshift Cluster after you are finished using it to avoid incurring additional charges. Note that shutting down the cluster will wipe out your database and data.
You can build your own actions to manage your cluster using the Talend tAmazonRedshiftManage component. This component will allow you to start and shutdown your Amazon Redshift Cluster. You can build your own actions and execute them as part of an Execution Plan in Talend Integration Cloud. The Execution Plan will then start your Redshift Cluster, run the flow(s) to populate it with data and query the data, and shutdown the cluster at the end.
Custom Action using Studio
The example in the previous section above demonstrates how to use Talend Integration Cloud actions to connect, load and query data in an Amazon Redshift Cluster. The example Actions were all built using the Talend Studio that comes with the Talend Integration Cloud platform. The example Actions come with a predefined database structure and dataset. For your own usage, you can import these actions into your Talend Studio and customize them to your requirements. Or you can build your own action from scratch.
Step 1: Import the Talend Integration Cloud Demo Project
- Start the Studio
- Import a demo project
- Select the Integration Cloud Demos
Step 2: Configure your Talend Integration Cloud Preferences
- Go to your Talend Studio Preferences and configure the Account Username and Password for Talend Integration Cloud
- Click on Test Connection to make sure that your studio can connect to the cloud service
Step 3: Review existing Redshift Actions
- Browse the Repository and review the example actions available in the demo project for Redshift.
- Open an example action and review how it has been designed.
Step 4: Design and Test your Actions
- You can duplicate the example actions and modify them to fit your requirements.
- Use the tActionInput and tActionOutput components to pass data in and out of your action. Adding a tActionInput means that the action will be either a Step or Target Action. Adding a tActionOutput means that the action will be either a Source or Step action. Having both in the job means that it is only a Step action.
- Configure the Schema of the tActionInput and tActionOutput components.
- We can use all the constructs of the Talend Studio, like context parameters to make our query dynamic as shown below. The context parameter will appear in the cloud UI as a parameter.
- Test your action in the Studio
Step 5: Publish the action
- You can publish your action to the cloud when it is ready. Right click on the job and click on 'Publish to Cloud'.
Step 6: Design you Flow
- Once your action published, you can log into Talend Integration Cloud and build a flow with your newly published action
Talend Integration Cloud provides great flexibility to build re-usable actions that can leverage the hundreds of connectors available in the Talend Studio. It allows businesses to create their own re-usable action to perform ETL with no infrastructure cost.
Redshift Components in Talend Studio
Talend Studio has 12 components for managing and working with Amazon Redshift. A Talend Integration Cloud developer will leverage these components to create re-usable Actions for Amazon Redshift in Talend Integration Cloud.
tAmazonRedshiftManage manages the work of creating a new Amazon Redshift cluster, creating a snapshot of an Amazon Redshift cluster, and deleting an existing cluster or snapshot.
The tRedshiftOutputBulk and tRedshiftBulkExec components can be used together in a two step process to load data to Amazon Redshift from a delimited/CSV file on Amazon S3. In the first step, a delimited/CSV file is generated. In the second step, this file is used in the INSERT statement used to feed Amazon Redshift. These two steps are fused together in the tRedshiftOutputBulkExec component. The advantage of using two separate steps is that the data can be transformed before it is loaded to Amazon Redshift.
tRedshiftClose closes the transaction committed in the connected DB.
Using a unique connection, this component commits in one go a global transaction instead of doing that on every row or every batch and thus provides gain in performance.Note that if you need to commit each statement as an individual transaction, you have to use the Auto Commit function available in the connection component.
tRedshiftConnection opens a connection to the database for a current transaction.
tRedshiftInput executes a DB query with a strictly defined order which must correspond to the schema definition. Then it passes on the field list to the next component through a Main row link.
tRedshiftOutput executes the action defined on the table and/or on the data of a table, according to the input flow from the previous component.
This component receives data from the preceding component, generates a single delimited/CSV file and then uploads the file to Amazon S3. The tRedshiftOutputBulk and tRedshiftBulkExec components can be used together in a two step process to load data to Amazon Redshift from a delimited/CSV file on Amazon S3. In the first step, a delimited/CSV file is generated. In the second step, this file is used in the INSERT statement used to feed Amazon Redshift. These two steps are fused together in the tRedshiftOutputBulkExec component. The advantage of using two separate steps is that the data can be transformed before it is loaded to Amazon Redshift.
This component receives data from the preceding component, generates a single delimited/CSV file and uploads the file to Amazon S3, finally it loads the data from Amazon S3 to Redshift. The tRedshiftOutputBulk and tRedshiftBulkExec components can be used together in a two step process to load data to Amazon Redshift from a delimited/CSV file on Amazon S3. In the first step, a delimited/CSV file is generated. In the second step, this file is used in the INSERT statement used to feed Amazon Redshift. These two steps are fused together in the tRedshiftOutputBulkExec component. The advantage of using two separate steps is that the data can be transformed before it is loaded to Amazon Redshift.
tRedshiftRollback cancels the transaction commit in the connected DB.
tRedshiftRow is the specific component for this database query. It executes the SQL query stated onto the specified database. The row suffix means the component implements a flow in the job design although it does not provide output.
This component runs a specified query in Amazon Redshift and then unloads the result of the query to one or more files on Amazon S3.