AWS - Integrating Talend Data Integration with S3 and Lambda
AWS S3 (Simple Storage Service) is the very popular storage service of Amazon Web Services. It is widely used by customers and Talend provides out-of-the box connectivity with S3. AWS Lambda is another service which lets you run code without provisioning or managing servers. This is called Serverless computing.
In this article, we will demonstrate how to integrate Talend Data Integration with AWS S3 and AWS Lambda. We will build an event-driven architecture where an end-user drops a file in S3, and S3 notifies a Lambda function which triggers the execution of a Talend Job to process the S3 file.
Architecture
- A CSV file is uploaded into an S3 bucket.
- S3 sends a notification by invoking a Lambda function.
- The Lambda function invokes the execution of a Talend Job through Talend Administration Center HTTP API (MetaServlet API).
- Talend Administration Center launches the Talend Job on a Talend JobServer.
- The Talend Job downloads the CSV file from S3, computes then uploads the result back to S3.
Assumptions
- Amazon Web Services (AWS):
- You should be familiar with the AWS platform since this article does not take a deep dive into details regarding Administration and Management of AWS services. You can refer to the Amazon Web Services (AWS) - Getting Started to read on all the AWS functionalities that Talend provides.
- You should also have full access to the main AWS services described in the Prerequisites section below.
- Talend
You should be familiar with Installation and Management of Talend Data Integration.
- Eclipse
You should be familiar with Eclipse and Java since we will be using AWS toolkit with Eclipse to develop a Lambda function.
Environment
This demonstration is based on AWS Cloud Platform and Talend Data Integration.
Prerequisites
- A valid AWS account with full access to the following services:
- Amazon Elastic Compute Cloud - https://aws.amazon.com/ec2
- Amazon Simple Storage Service - https://aws.amazon.com/s3
- AWS Lambda - https://aws.amazon.com/lambda
- Amazon Cloudwatch - https://aws.amazon.com/cloudwatch
- Valid AWS access keys to programmatically access AWS services:
Read the documentation at http://docs.aws.amazon.com/en_en/general/latest/gr/managing-aws-access-keys.html to know how to create/manage/use AWS access keys.
- AWS Toolkit for Eclipse
Follow the online documentation at http://docs.aws.amazon.com/toolkit-for-eclipse/v1/user-guide/setup-install.html to install the AWS toolkit on your laptop. This will be used to develop the Lambda Java function.
- Talend Data Integration (Commercial Edition) -
https://www.talend.com/products/data-integration
- Talend Studio
- Talend Administration Center
- Talend JobServer
Choose an AWS Region
Procedure
Create a bucket
Procedure
Deploy Talend Administration Center on AWS EC2
Procedure
Deploy Talend JobServer on AWS EC2
As described in the architecture, a Talend JobServer will be used to execute the Talend Job which downloads CSV files from S3 to convert them into XML files.
Now let’s launch an EC2 instance and deploy Talend JobServer on AWS.
Procedure
Import the Job in Talend Studio
A sample Job has been provided to test the whole architecture.
This is a very simple Job which performs the following:
- Connect to S3 using provided access key credentials
- Create temporary files
- Download the S3 CSV file from folder input to a local temporary file
- Read the temporary CSV file then convert it to a temporary local XML file
- Upload the temporary XML file back to S3 into folder output
Let's test the Job in the local studio before deploying it in the Cloud.
Procedure
Build and deploy the Job
In the previous step, you have run the Job in your local Studio to test it. But the goal is to deploy this Job on the Talend environment hosted on AWS. Then, this Job will be automatically executed whenever a file is uploaded in the input folder on S3.
Let’s build and deploy the Job.
Procedure
Create a Java Lambda function
To create a Lambda function, we will be using the AWS toolkit with Eclipse.
Download and setup AWS Toolkit with Eclipse on your laptop using instructions at https://aws.amazon.com/en/eclipse/.
Once configured, use Eclipse to create a new Lambda Java project: