AWS - Integrating Talend Data Integration with S3 and Lambda - 7.3

Version
7.3
Language
English (United States)
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Administration Center
Talend JobServer
Talend Runtime
Talend Studio
Content
Administration and Monitoring
Data Governance
Data Quality and Preparation
Deployment
Design and Development
Installation and Upgrade

AWS - Integrating Talend Data Integration with S3 and Lambda

AWS S3 (Simple Storage Service) is the very popular storage service of Amazon Web Services. It is widely used by customers and Talend provides out-of-the box connectivity with S3. AWS Lambda is another service which lets you run code without provisioning or managing servers. This is called Serverless computing.

In this article, we will demonstrate how to integrate Talend Data Integration with AWS S3 and AWS Lambda. We will build an event-driven architecture where an end-user drops a file in S3, and S3 notifies a Lambda function which triggers the execution of a Talend Job to process the S3 file.

Architecture

  1. A CSV file is uploaded into an S3 bucket.
  2. S3 sends a notification by invoking a Lambda function.
  3. The Lambda function invokes the execution of a Talend Job through Talend Administration Center HTTP API (MetaServlet API).
  4. Talend Administration Center launches the Talend Job on a Talend JobServer.
  5. The Talend Job downloads the CSV file from S3, computes then uploads the result back to S3.

Assumptions

  1. Amazon Web Services (AWS):
    • You should be familiar with the AWS platform since this article does not take a deep dive into details regarding Administration and Management of AWS services. You can refer to the Amazon Web Services (AWS) - Getting Started to read on all the AWS functionalities that Talend provides.
    • You should also have full access to the main AWS services described in the Prerequisites section below.
  2. Talend

    You should be familiar with Installation and Management of Talend Data Integration.

  3. Eclipse

    You should be familiar with Eclipse and Java since we will be using AWS toolkit with Eclipse to develop a Lambda function.

Environment

This demonstration is based on AWS Cloud Platform and Talend Data Integration.

Prerequisites

  1. A valid AWS account with full access to the following services:
  2. Valid AWS access keys to programmatically access AWS services:

    Read the documentation at http://docs.aws.amazon.com/en_en/general/latest/gr/managing-aws-access-keys.html to know how to create/manage/use AWS access keys.

  3. AWS Toolkit for Eclipse

    Follow the online documentation at http://docs.aws.amazon.com/toolkit-for-eclipse/v1/user-guide/setup-install.html to install the AWS toolkit on your laptop. This will be used to develop the Lambda Java function.

  4. Talend Data Integration (Commercial Edition) - https://www.talend.com/products/data-integration
    • Talend Studio
    • Talend Administration Center
    • Talend JobServer

Choose an AWS Region

Procedure

  1. Connect to AWS console.
  2. Choose a region where AWS Lambda is available since Lambda is not proposed in all regions at this time. In this demonstration, we choose Ireland.

    Other considerations when choosing an AWS region: latency, complicance requirements.

Create a bucket

Procedure

  1. Connect to AWS S3 console.
  2. Click Create bucket.
    • Bucket name = talend-lambda-demos
    • Region = EU (Ireland)
  3. Click Next.
  4. Set properties: keep default values.
  5. Click Next.
  6. Set permissions:
    1. Click Manage public permissions.
    2. Add Read/Write permissions on Objects for Any Authenticated AWS user.
  7. Click Next and review.
  8. Click Create bucket.

    The bucket is created. Now, let's create two folders inside the bucket:

    • the input folder: files will be dropped into this folder.
    • the output folder: Talend Job will write results into this folder.
  9. Click Create folder, type input as the name then click Save.

    Click again Create folder, type output as the name then click Save.

    You have successfully created your bucket containing two folders in Ireland region.

Deploy Talend Administration Center on AWS EC2

Let’s launch an EC2 instance and deploy Talend Administration Center on AWS which is the main component of this architecture.

Procedure

  1. Connect to EC2 console in Ireland region.
  2. Click Launch instance.
  3. Choose AMI, Amazon Linux AMI 2018.03.0 (HVM) 64 bit is a good fit for this demo.
  4. Select t2.medium which is a good fit for this demo.
  5. Configure Instance Details.
    • Choose your default VPC
    • Enable Auto-assign Public IP
    • Keep other settings with default values
  6. Add Storage, Size = 16 GiB.
  7. Add Tags, Key = Name and Value = Talend Administration Center .
  8. Configure Security Group.
    • Add an SSH rule with values:
      • Type = SSH
      • Protocol = TCP
      • Port Range = 22
      • Source = Custom 0.0.0.0/0
    • Add a custom TCP rule with values:
      • Type = Custom TCP Rule
      • Protocol = TCP
      • Port Range = 8080
      • Source = Custom 0.0.0.0/0
  9. Review and Launch.
    • Ignore the warning on security group for this time even if you should avoid opening ports to 0.0.0.0/0 in real environments.

      Take note of the security group id as it will be needed when creating the security group of the Talend JobServer.

  10. Launch the instance.
  11. Follow the section Installing and configuring Talend Administration Center in Talend Installation Guide to install and configure Talend Administration Center on the EC2 instance you just launched.

    Once you have installed Talend Administration Center on the EC2 instance, verify that you can connect to Talend Administration Center Web interface using your favorite web browser.

    The Talend Administration Center URL to use should look like this: http://<PUBLIC_IP_OF_YOUR_EC2_INSTANCE>:8080/tac, where <PUBLIC_IP_OF_YOUR_EC2_INSTANCE> being the public IP of the EC2 instance hosting the Talend Administration Center. This information is available in the instance details on AWS EC2 console.

    We will use the credentials admin@company.com/admin.

Deploy Talend JobServer on AWS EC2

As described in the architecture, a Talend JobServer will be used to execute the Talend Job which downloads CSV files from S3 to convert them into XML files.

Now let’s launch an EC2 instance and deploy Talend JobServer on AWS.

Procedure

  1. Connect to EC2 console in Ireland region.
  2. Click Launch instance.
  3. Choose AMI, Amazon Linux AMI 2018.03.0 (HVM) 64 bit is a good fit for this demo.
  4. Select t2.medium which is a good fit for this demo.
  5. Configure Instance Details.
    • Choose your default VPC.
    • Enable Auto-assign Public IP.
    • Keep other settings with default values.

    Note that you must define an IAM role with access to the S3 bucket if you want to use the Inherit credentials from AWS role option of the S3 Talend components. This option delegates the access to the role inheritance and thus you will not need a Secret Access Key. In this article, we are using an access key in our S3 components to keep it simple.

  6. Add Storage, Size = 16GiB.

    Note that since the S3 files are downloaded from S3 to the execution server, you should size the disk appropriately so that it can hold your S3 file input and the output file created by your Job(s). After uploading the output file to S3, we can design our DI Job(s) to delete all local files to clean up after the operation.

  7. Add Tags, Key=Name and Value=Talend JobServer.
  8. Configure Security Group. Create a new security group with the following rules:
    • Add an SSH rule with values:
      • Type = SSH
      • Protocol = TCP
      • Port Range = 22
      • Source = Custom 0.0.0.0/0
    • Add a custom TCP rule with values:
      • Type = Custom TCP Rule
      • Protocol = TCP
      • Port Range = 8000
      • Source = Custom <your TAC security group id>
    • Add a custom TCP rule with values:
      • Type = Custom TCP Rule
      • Protocol = TCP
      • Port range = 8001
      • Source = Custom <your TAC security group id>
    • Add a custom TCP rule with values:
      • Type = Custom TCP Rule
      • Protocol = TCP
      • Port range = 8888
      • Source = Custom <your TAC security group id>
  9. Review and Launch.

    Ignore the warning on security group for this time.

    Best practice is to avoid using 0.0.0.0/0. Instead, restrict the port to your corporate IP addresses.

  10. Launch the instance.
  11. Follow the section Installing and configuring your Talend JobServer in Talend Installation Guide to install and configure your Talend JobServer.
  12. Declare the Talend JobServer as an execution server in Talend Administration Center via the following steps.
    1. Connect to Talend Administration Center web interface with a web browser.
    2. Navigate to Conductor > Servers.
    3. Click Add > Add server.
    4. Use the settings as below to declare your Job server.
      • Label = job server
      • Host = <the private ip of the ec2 server hosting your job server>

        The private IP address can be found in EC2 console in the instance details.

        Using private IP address instead of Public IP is sufficient for Talend Administration Center EC2 to reach the Talend JobServer host since both EC2 hosts are located in the same default VPC.

      • Keep all other fields with default values
    5. Click Save. This should add the server to the list of servers as below:

    You have successfully installed Talend JobServer on AWS EC2 and declare it as an execution server in Talend Administration Center.

    In this example, we are using an always on execution server. We can also evolve this architecture to use an EC2 Server definition in Talend Administration Center. Refer to the documentation on how to add an EC2 execution server to Talend Administration Center. Talend Administration Center can start the execution server EC2 instance before executing the Job, and then shut down the EC2 instance when the Job has finished executing. You can add an EC2 Server definition per task so that an EC2 instance is started for each Job, and then shutdown. This provides an scalable architecture which adheres to AWS principles, i.e. only use resources when you need to compute data. When there is no file in S3 to process, there is no execution server running.

Import the Job in Talend Studio

A sample Job has been provided to test the whole architecture.

This is a very simple Job which performs the following:

  • Connect to S3 using provided access key credentials
  • Create temporary files
  • Download the S3 CSV file from folder input to a local temporary file
  • Read the temporary CSV file then convert it to a temporary local XML file
  • Upload the temporary XML file back to S3 into folder output

Let's test the Job in the local studio before deploying it in the Cloud.

Procedure

  1. Download the file customers_csv_to_xml.zip from the Downloads tab in the left panel of this page.
  2. Launch Talend Studio.
  3. Create a project with the name Demos in the Studio.
  4. Import the Job in Talend Studio using the zip file customers_csv_to_xml.zip.
  5. Open the Job customers_csv_to_xml once imported.
  6. Download the CSV file customers.csv from the Downloads tab in the left panel of this page and upload it to S3 in the folder input of the talend-lambda-demos bucket.
  7. In Talend Studio, set the Job context parameters:
    • context.s3_file = input/customers.csv
    • context.aws_access_key_id = <your aws access key id>
    • context.aws_secret_access_key = <your aws secret access key>

    Read the documentation at http://docs.aws.amazon.com/en_en/general/latest/gr/managing-aws-access-keys.html to know how to create/manage/download aws access keys.

  8. Execute the Job.
    • The CSV file is downloaded, transformed into an xml file, then uploaded in S3 output folder.
    • Check the output folder in S3. It should contain a new xml file as below:

    You have successfully tested the Job with Talend Studio. Next step is to build the Job then deploy it with Talend Administration Center on AWS.

Build and deploy the Job

In the previous step, you have run the Job in your local Studio to test it. But the goal is to deploy this Job on the Talend environment hosted on AWS. Then, this Job will be automatically executed whenever a file is uploaded in the input folder on S3.

Let’s build and deploy the Job.

Procedure

  1. Build the Job in Studio via the following steps:
    1. In the Studio, right-click the Job customers_csv_to_xml then select Build Job.
    2. Specify the path for the archive file in the To archive file field.
    3. Select Standalone Job from the Select the build type drop-down list.
    4. Click Finish.
  2. Connect to Talend Administration Center.
  3. In the menu tree view, click Projects.
  4. Create a new project Demos.
  5. Go to Settings > Project authorizations. Then add read/write permissions on project Demos for user admin@company.com you are connected with.
  6. In the menu tree view, click Conductor > Job Conductor.
  7. Click on Add > Normal Task.
  8. In the Label field, enter customers_csv_to_xml.
  9. Click on the Import zip icon and upload the Job zip file. This should automatically fill other fields except the execution server.
    Note: If you have trouble on uploading the zip file, this can be related to the size of the zip file and the latency between your location and AWS cloud causing upload timeout on Talend Administration Center side. Possible workaround: launch a windows ec2 instance in the same region/vpc than Talend Administration Center, upload the zip file to this instance then use the web browser in this windows ec2 instance to upload the zip file.
  10. Select job server as execution server.
  11. Review your settings then click Save.
  12. Again you can test the Job by doing these steps:
    1. In Job Conductor, set the context values aws_secret_access_key, aws_access_key_id and s3_file (full path with folder, e.g. input/customers.csv).
    2. Deploy the task.
    3. Upload a csv file in S3 according to the parameter s3_file (e.g. input/customers.csv).
    4. Run the task.
    5. Check the result.

Create a Java Lambda function

To create a Lambda function, we will be using the AWS toolkit with Eclipse.

Download and setup AWS Toolkit with Eclipse on your laptop using instructions at https://aws.amazon.com/en/eclipse/.

Once configured, use Eclipse to create a new Lambda Java project:

Procedure

  1. Click File > New > Project.
  2. In the New Project Wizard, click AWS > AWS Lambda Java Project.
  3. Click Next.
  4. Set the values as below:
    • Project Name = TalendLambdaProject
    • Package Name = com.talend.lambda.handlers
    • Class Name = S3LambdaFunctionHandler
    • Output type = Object
  5. Click Finish.

    The README guide is displayed. Take some time to read it and get familiar with the steps to create, develop and deploy a Lambda function.