AWS - Integrating Talend Data Integration with S3 and Lambda

author
Amadou Merico
EnrichVersion
6.4
6.3
EnrichProdName
Talend Real-Time Big Data Platform
Talend Data Integration
Talend Data Fabric
Talend Big Data
Talend Big Data Platform
Talend ESB
Talend Data Services Platform
Talend Data Management Platform
Talend MDM Platform
task
Design and Development
Installation and Upgrade
Data Quality and Preparation
Administration and Monitoring
Deployment
Data Governance
EnrichPlatform
Talend Administration Center
Talend Runtime
Talend JobServer
Talend Studio

AWS - Integrating Talend Data Integration with S3 and Lambda

AWS Simple Storage Service is the very popular storage service of Amazon Web Services. It is widely used by customers and Talend provides out-of-the box connectivity with S3.AWS Lambda is a another service which lets you run code without provisioning or managing servers.

This is called Serverless computing .

In this article, we will demonstrate how to integrate Talend Data Integration with AWS S3 and AWS Lambda. We will build an event-driven architecture where an end-user drops a file in S3, the S3 notifies a Lambda function which triggers the execution of a Talend Job to process the S3 file.

Architecture
  1. A CSV file is uploaded into an S3 bucket.
  2. S3 sends a notification by invoking a Lambda function.
  3. The Lambda function invokes the execution of a Talend Job through Talend Administration Center HTTP API ( MetaServlet API ).
  4. Talend Administration Center launches the Talend Job on a Talend Job Server.
  5. The Talend Job downloads the CSV file from S3, computes then uploads the result back to S3.
Assumptions
  1. Amazon Web Services (AWS) :
    • You should be familiar with the AWS platform since this article does not take a deep dive into details regarding Administration and Management of AWS services. You can refer to the Amazon Web Services (AWS) - Getting Started to read on all the AWS functionalities that Talend provides.
    • You should also have full access to the main AWS services described in the Prerequisites section below.
  2. Talend:
    • You should be familiar with Installation and Management of Talend Data Integration
  3. Eclipse
    • You should be familiar with Eclipse and Java since we will be using AWS toolkit with Eclipse to develop a Lambda function.
Environment

This demonstration is based on AWS Cloud Platform and Talend Data Integration 6.3.

Prerequisites
  1. A valid AWS account with full access to following services:
  2. Valid AWS Access Keys to programmatically access AWS services
  3. AWS Toolkit for Eclipse
  4. Talend Data Integration(Commercial Edition) - https://fr.talend.com/products/data-integration .
    • Talend Studio
    • Talend Administration Center
    • Talend Job Server
Setup

Choose an AWS Region

Choose an AWS Region

Connect to AWS console.

Choose a region where AWS Lambda is available since Lambda is not proposed in all regions at this time.

Other considerations when choosing an AWS region: latency, complicance requirements.

We choose Ireland region for this demonstration.

Create a bucket

Create a bucket

If you are not familiar with S3 bucket, reading first the documentation before going ahead may be very useful: http://docs.aws.amazon.com/en_en/AmazonS3/latest/dev/UsingBucket.html .

We will be using the new S3 console for this demonstration.

Follow the steps below:

  1. Connect to S3 console
  2. Click Create bucket
  3. Name and Region:
    • Bucket name = talend-lambda-demos
    • Region = EU (Ireland)
  4. Click Next
  5. Set properties: keep default values.
  6. Click Next
  7. Set permissions:
    1. Click Manage public permissions
    2. Add Read/write permissions on Objects for Any Authenticated AWS user
  8. Click Next
  9. Review
  10. Click Create bucket .

The bucket is created!

Now, let's create two folders inside the bucket:

  • input folder: files will be dropped into this folder.
  • output folder: Talend Job will write results into this folder.

In order to do so:

  • Click Create folder , type input as the name then click Save .
  • Click again Create folder , type output as the name then click Save .

Congratulations! You have successfully created your bucket containing two folders in Ireland region.

Deploy Talend Administration Center on AWS EC2

Deploy Talend Administration Center on AWS EC2

Let’s deploy Talend Administration Center on AWS which is the main component of this architecture. If your are not familiar with Talend Administration Center, please read the documentation first.

Launch an EC2 instance
  1. Connect to EC2 console in Ireland region
  2. Click Launch Instance
  3. Choose AMI: Amazon Linux AMI 2016.09.1 64 bit is a good fit for this demo.
  4. Select t2.medium which is a good fit for this demo again.
  5. Next: Configure Instance Details:
    • Choose your default VPC
    • Enable Auto-assign Public IP
    • Keep other settings with default values

  6. Next: Add Storage.
    • Size = 16GiB
  7. Next: Add Tags.
    • Name = Talend Administration Center
  8. Next: Configure Security Group.
    • Add an SSH rule with values:
      1. Type = ssh
      2. Protocol = TCP
      3. Port range = 22
      4. Source = Custom 0.0.0.0/0
    • Add a custom TCP rule with values:
      1. Type = Custom TCP Rule
      2. Protocol = TCP
      3. Port range = 8080
      4. Source = Custom 0.0.0.0/0
  9. Review and Launch
    • Ignore the warning on security group for this time even if you should avoid opening ports to 0.0.0.0/0 in real environments.

      Take note of the security group id as it will be needed when creating the security group of the Job server.

  10. Launch the instance.
Install Talend Administration Center

Follow the Installation Guide at [!!! insert Installing+and+configuring+Talend+Administration+Center link !!!] to install and configure Talend Administration Center application on the EC2 instance you just launched.

Once you have installed Talend Administration Center on the EC2 instance, verify that you can connect to Talend Administration Center Web interface using your favorite web browser.

The Talend Administration Center URL to use should look like this:

http://<PUBLIC_IP_OF_YOUR_EC2_INSTANCE>:8080/org.talend.administrator or http://<PUBLIC_IP_OF_YOUR_EC2_INSTANCE>:8080/tac

where <PUBLIC_IP_OF_YOUR_EC2_INSTANCE> being the public IP of the EC2 instance hosting the Talend Administration Center. This information is available in the instance details on AWS EC2 console.

We will keep on using the default credentials admin@company.com/admin .

Congratulations! You have successfully installed Talend Administration Center on AWS EC2.

Deploy Talend Job Server on AWS EC2

Deploy Talend Job Server on AWS EC2

As described in the architecture, a Job server will be used to execute the Talend Job which downloads CSV files from S3 to convert them into XML files.

Launch an EC2 instance
  1. Go to EC2 console in Ireland region
  2. Click Launch Instance
  3. Choose AMI: Amazon Linux AMI 2016.09.1 64 bit is a good fit for this demo. Select this AMI.
  4. Select t2.medium which is a good fit for this demo again.
  5. Next: Configure Instance Details.
    • Choose your default VPC
    • Enable Auto-assign Public IP
    • Keep other settings with default values

      Note that you must define an IAM role with access to the S3 bucket if you want to use the "Inherit credentials from AWS role" option of the S3 Talend components. This option delegates the access to the role inheritance and thus you will not need a Secret Access Key. In this article, we are using an access key in our S3 components to keep it simple.

  6. Next: Add Storage.
    • Size = 16GiB

      Note that since the S3 files are downloaded from S3 to the execution server, you should size the disk appropriately so that it can hold your S3 file input and the output file created by your job(s). After uploading the output file to S3, we can design our DI job(s) to delete all local files to clean up after the operation.

  7. Next: Add Tags.
    • Name = Talend Job Server
  8. Next: Configure Security Group. Create a new security group with following rules:
    1. Add an SSH rule with values:
      1. Type = ssh
      2. Protocol = TCP
      3. Port range = 22
      4. Source = Custom 0.0.0.0/0
    2. Add a custom TCP rule with values
      1. Type = Custom TCP Rule
      2. Protocol = TCP
      3. Port range = 8000
      4. Source = Custom sg-20646946 ( Here, replace with your own TAC security group id )
    3. Add a custom TCP rule with values
      1. Type = Custom TCP Rule
      2. Protocol = TCP
      3. Port range = 8001
      4. Source = Custom sg-20646946 ( Here, replace with your own TAC security group id )
    4. Add a custom TCP rule with values
      1. Type = Custom TCP Rule
      2. Protocol = TCP
      3. Port range = 8888
      4. Source = Custom sg-20646946 ( Here, replace with your own TAC security group id )
  9. Review and Launch.
  • Ignore the warning on security group for this time

Best practice is to avoid using 0.0.0.0/0. Instead, restrict the port to your corporate IP addresses.

  • Launch the instance
Install Talend Job Server

Please follow the Installation Guide at 2. Installing your Talend product using Talend Installer (recommended) to install and configure your Talend Job Server. Installation of the Job server is pretty easy and straightforward when using Talend installer.

Declare the Job Server as an execution server in Talend Administration Center

It’s time to declare the Job server as an execution server in Talend Administration Center.

To do so, follow these steps to do so:

  1. Connect to TAC web interface with a web browser
  2. Navigate to Conductor > Servers .
  3. Click Add > Add server
  4. Use the settings as below to declare your job server
    • Label = job server
    • Host = <use the private ip of the ec2 server hosting your job server>

      Note

      The private IP address can be found in EC2 console in the instance details.

      Using private IP address instead of Public IP is sufficient for Talend Administration Center EC2 to reach the Job Server host since both EC2 hosts are located in the same default VPC.

    • Keep all other fields with default values
  5. Click Save. This should add the server to the list of servers as below:

Congratulations! You have successfully installed Talend Job Server on AWS EC2 and declare it as an execution server in Talend Administration Center.

In this example, we are using an always on execution server. We can also evolve this architecture to use an EC2 Server definition in Talend Administration Center. Refer to the documentation on how to add an EC2 execution server to the Talend Administration Center. The Talend Administration Center can start the execution server EC2 instance before executing the job, and then shut down the EC2 instance when the job has finished executing. You can add an EC2 Server definition per task so that an EC2 instance is started for each job, and then shutdown. This provides an scalable architecture which adheres to AWS principles, i.e. only use resources when you need to compute data. When there is no file in S3 to process, there is no execution server running.

Import the Job in Talend Studio

Import the job in Talend Studio

A very simple job has been provided in Related Files section below. This job will be used to test the whole architecture.

  1. Download the file customers_csv_to_xml.zip from Related Files section below.
  2. Launch Talend Studio 6.3.
  3. Create a project with name Demos in the studio.
  4. Import the job in Talend studio using the zip file customers_csv_to_xml_0.1.zip .
    • Read the documentation at 5.2.1 How to import items to know the steps to import items in Talend studio.
  5. Open the job customers_csv_to_xml once imported.

This is a very simple job which performs the following:

  • Connect to S3 using provided access key credentials
  • Create temporary files
  • Download the S3 CSV file from folder input to a local temporary file
  • Read the temporary CSV file then convert it to a temporary local XML file
  • Upload the temporary XML file back to S3 into folder output .

Let's test the job in the local studio before deploying it in the cloud.

  1. A CSV file named customers.csv is provided in the Attachments section below. Download it locally then upload it to S3 in the folder input of talend-lambda-demos bucket.
  2. In Talend studio, set the Job context parameters:
    1. context.s3_file = input/customers.csv
    2. context. aws_access_key_id : < use your aws access key id>
    3. context. aws_secret_access_key = <use your aws secret access key>

    Read the documentation at http://docs.aws.amazon.com/en_en/general/latest/gr/managing-aws-access-keys.html to know how to create/manage/download aws access keys

  3. Execute the job:
    1. The CSV file is downloaded, transformed into an xml file, then uploaded in S3 output folder.
    2. Check the output folder in S3. It should contain a new xml file as below:

Congratulations! You have successfully tested the Job with Talend Studio. Next step is to build the Job then deploy it with Talend Administration Center on AWS.

Build and deploy the Job

Build and deploy the job

In the previous step, you have run the job in your local studio to test it. But the goal is to deploy this job on the Talend environment hosted on AWS. Then, this job will be automatically executed whenever a file is uploaded in input folder on S3.

Let’s build and deploy the job.

  1. Build with Studio. Read the documentation at 5.2.2 How to build Jobs if you are not familiar with this procedure.
    • In the studio, right-click the job customers_csv_to_xml_file then select build the job .
    • Choose a destination (folder and filename) on your laptop.
    • Choose standalone job as build type.
    • Keep all other values as default.
    • Then click Finish .
  2. Connect to Talend Administration Center.
  3. Click Projects .
  4. Create a new project :
    • Label = Demos
    • Project type = Data Quality
    • Storage = None
  5. Go to Settings > Project Authorizations . Then add read/write permissions on project Demos for user admin@company.com you are connected with.
  6. Click on Conductor > Job Conductor
  7. Click on Add > Normal Task