Amazon Web Services (AWS) - Getting Started

EnrichVersion
6.4
6.3
6.2
6.1
6.0
5.6
EnrichProdName
Talend Big Data
Talend Data Fabric
Talend Data Management Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Data Integration
Talend Big Data Platform
Talend ESB
Talend Data Services Platform
task
Installation and Upgrade
Deployment
Administration and Monitoring
Design and Development
EnrichPlatform
Talend CommandLine
Talend Runtime
Talend Studio
Talend Artifact Repository
Talend ESB
Talend Administration Center
Talend JobServer

Amazon Web Services (AWS) - Getting Started

Overview

Amazon Web Services (AWS) provides on-demand computing resources and services in the cloud, with pay-as-you-go pricing, to individuals, companies and governments. The AWS documentation is very extensive and is a good way to get started with the basic services that AWS provides.

In this article, we will outline those services where you can leverage Talend for your data integration requirements. We will describe one reference architecture for deploying Talend within the AWS Cloud Computing platform and highlight some best practices to adopt.

AWS Global Infrastructure - Regions and Availability Zones

The AWS Global Infrastructure is built around Regions and Availability Zones (AZ).  At the time of writing this article, i.e. June 2016, AWS is present around the globe in 12 geographic regions. These 12 physical geographic regions are:

Americas Europe/Middle East/Africa Asia Pacific
  • Northern Virginia
  • Oregon
  • Northern California
  • São Paulo
  • GovCloud
  • Ireland
  • Frankfurt
  • Singapore
  • Tokyo
  • Sydney
  • Seoul
  • Beijing
Note: Note that AWS has no Region defined in Africa yet. The closest are the European regions.
AWS has several data centers geographically dispersed across each Region. An Availability Zone consists of one or more of those discrete data centers, each with redundant power, networking and connectivity, housed within separate facilities. These Availability Zones offer the ability to operate production applications and databases which are more highly available, fault tolerant and scalable than would be possible from a single data center. The diagram below represent how the Region and Availability Zones relate to each other.

AWS operates 33 Availability Zones within the 12 geographic Regions mentioned above as described here.

Networking

AWS Cloud provides 3 Networking services namely:

Icon Name Description
Amazon VPC Amazon Virtual Private Cloud (VPC) lets you define private computing resources that are logically isolated in a virtual network that you own. You can easily control access to and the traffic in and out of the virtual private network. It is secured and presents an easy way to segregate various environments. For example, a Talend customer looking to setup multiple environments will use different VPC for each environment.
AWS Direct Connect AWS Direct Connect makes it easy to establish dedicated network connection from an on-premise network to AWS. It has many benefits like reducing bandwidth costs, and providing consistent network performance than internet-based connections. This is generally transparent for Talend usage. The IT group of the company will set this up.
Amazon Route 53 Amazon Route 53 is a Domain Name System (DNS) web service. This service connects user request to services running within the Cloud. It allows the use of a fully qualified domain name instead of an IP Address.

While Talend is transparent to all Networking services, it is important to understand how they work since they relate to connectivity. The security, ports, firewall, DNS name, and other networking configuration will allow or restrict access to services running within EC2 instances and AWS Cloud. Hence, they need to be configured properly for Talend Jobs and Services to be able to connect to resources like S3 Bucket, EMR Cluster, Databases, etc.

Compute

AWS Cloud provides the following 3 main Compute services:

Icon Name Description
Amazon EC2 Amazon Elastic Compute Cloud (EC2) is a web service that provides the ability to commission server instances within minutes. You can instantiate Windows and Linux servers with configurable RAM, CPU and Disk space. Server instances can be started in various modes to optimise cost versus computing power needed. A Talend customer will leverage the EC2 service to instantiate servers for deploying the Talend platform.

Talend Usage:

  • Use Linux or Windows Server to install Talend Administration Center, Talend JobServer, Talend Runtime, MDM Server, and all other server components. Generally we will use 1 or more EC2 instances within various VPC to setup Talend for DEV, TEST, UAT and Production environments.
  • Talend Administration Center can dynamically start, stop and terminate a EC2 Server running a JobServer for Job execution.

    For more information about how to execute data integration Jobs on a server based on Amazon EC2, see the Talend Administration Center User Guide.

Amazon EC2 Container Service Amazon EC2 Container Service (ECS) is a container management service that supports Docker containers. The Docker containers are started on an EC2 instance within the customer's own VPC, thereby providing a high level of isolation for your applications. It is possible to deploy Talend within Docker containers that will run within the ECS.

Talend Usage:

  • Use Docker container to deploy Talend server components like Talend Administration Center, Job Server, Runtime, etc.
Elastic Beanstalk Amazon Elastic Beanstalk is a service for deploying web applications and services developed in languages like Java, .Net, PHP, Node.js, Python and Docker on familiar server like Apache, Nginx, Passenger and IIS. It is a service that only requires the developer to upload the application binaries. The services handles the deployment, capacity provisioning, load balancing and auto-scaling according to the application health monitoring.

Talend Usage:

  • One potential use of Elastic Beanstalk (still to be tested) is to deploy Micro Services as a Java application. Micro Services functionality is available in Talend ESB from version 6.2. Amazon can auto-scale the number of Beanstalk instances based on the volume of messages on one or more SQS queues which have been configured as part of the auto-scaling. This is all done in AWS configuration. The MicroServices will need to be stateless and will need to be configured such that multiple instances, when started, still behave correctly in Beanstalk.

Databases

AWS Cloud provides the following 4 categories of Database services:

Icon Name Description
Amazon RDS

Amazon Relational Database Service (RDS) is an easy to operate, scalable relational database in the cloud. Amazon RDS make it easy to deploy high available database instances based on Amazon Aurora, Oracle, Microsoft SQL Server, PostgreSQL, MySQL and MariaDB. Customers can deploy Master and Slave topologies for their database needs and let RDS to handle the failover and high-availability. Multiple instances can be deployed in different availability zones for a Multi-Zone configuration.

Talend Usage:

  • Use for Talend Administration Center (TAC) database. The Talend Administration Center database contains metadata like users, projects and tasks. It is generally small and less than 1 GB in size (even if you have hundreds of jobs). The RDS service is recommended when clustering TAC scheduler.
  • Potential use for ActiveMQ service when setting up master/slave topology for ActiveMQ. Note that it is preferable to use EFS with kahadb for ActiveMQ.
  • Talend provides components for connectivity in Jobs and Services
Amazon Redshift

Amazon Redshift is a columnar storage technology that has been optimized for petabyte-scale data warehouse projects. The architecture of the service allows customers to automate the most common tasks of provisioning, configuring, and monitoring a cloud data warehouse. The data warehouse content can be configured to automatically and incrementally back up into Amazon S3.

Talend Usage:

  • Talend provides components for connectivity in Jobs and Services
Amazon DynamoDB

Amazon DynamoDB is a NoSQL database that supports both document and key-value storage models.

Talend Usage:

  • Talend provides components for connectivity in Jobs and Services

Storage

AWS Cloud provides the following Storage services:

Icon Name Description
Amazon S3

Amazon Simple Storage Service (S3) provides customers with an object storage service. It is the storage infrastructure that AWS uses for various other services. Amazon S3 provides a range of storage classes designed for different use cases including:

  • General-purpose storage of frequently accessed data
  • S3 Standard - Infrequent Access for long-lived data that is less frequently accessed
  • Amazon Glacier for long-term archive

These are controlled by configurable policies on the data.

Talend Usage:

  • Talend provides components for connectivity in Jobs and Services
Amazon Glacier

As mentioned above, Amazon Glacier is built on top of S3. It leverages the same infrastructure. Amazon Glacier is a service for a secure, durable and extremely low cost storage service for long-term backup and data archiving. Retrieving data from Amazon Glacier will take several hours for the request to be processed due to the nature of the service. It is like a replacement for tape. Customer should not use Glacier for frequently accessed data. The use-case is to retrieve data from Glacier once every few years, in case of a disaster recover.

Talend Usage:

  • Talend provides S3 components to load data into S3. Since the Amazon Glacier files and folders can be controlled using S3 policies, the same components can write to Glacier.
Amazon EBS

Amazon Elastic Block Store (EBS) provides persistent block level storage volumes for use with Amazon EC2 instances in the cloud. Amazon EBS provides raw block IO access and is suitable to be attached to 1 server instance. If we need storage to be attached to multiple instances, then we need Amazon Elastic File System (EFS) which exposes the NFSv4 protocol.

Talend Usage:

  • Use as a disk drive when the EBS is attached to the EC2 instance.
Amazon EFS

Amazon Elastic File System (EFS) is a simple, scalable file storage system, exposing the NFSv4 protocol, for use with multiple EC2 instances at the same time in the cloud. With Amazon EFS, the storage capacity is elastic, growing and shrinking automatically as we add or remove files from system. We use Amazon EFS for the Job Archives folder when multiple Talend Administration Center (TAC) are clustered at the scheduler level.

Talend Usage:

  • Use as a disk drive when the EFS is attached to multiple EC2 instances. It is especially useful when clustering multiple Talend Administration Centers and having to share the content of the Job Archives folder from multiple Talend Administration Centers.

 Analytics

AWS Cloud provides the following Analytics services:

Icon Name Description
Amazon EMR

Amazon Elastic MapReduce (EMR) is a managed Hadoop framework provided as a service. It runs across multiple EC2 instances and can run distributed frameworks such as Apache Spark.

Talend Usage:

  • Provides connectivity and components to manage and leverage EMR as a hadoop platform.

    For more information about how to use Talend with EMR, see Amazon EMR - Getting Started.

Amazon Kinesis

Amazon Kinesis is a platform for streaming data on AWS. It is basically a Saas technology that behaves very similarly to Apache Kafka with a high-throughput, distributed message queue, publish-subscribe system. Generally, applications connecting to Amazon Kinesis will be leveraging Spark Streaming.

Talend Usage:

  • Kinesis components in Spark Streaming use-cases.
Amazon ElasticSearch Service

Amazon Elasticsearch is a managed service for the Elastic Search engine. Elastic Search is a popular open-source analytics engine which is used very frequently with Logstash and Kibana for log analytics and real-time monitoring.

Talend Usage:

  • Talend provides elastic search components for use in Big Data use cases.
  • Talend can use elastic search as part of its infrastructure for storing logs from Talend Administration Center, Job Server, Runtime, etc.

Security

AWS Cloud provides the following Security services:

Icon Name Description
AWS IAM

AWS Identity and Access Management (IAM) enables you to control access to the AWS services and resources for your users and processes.

IAM provides the functionality to create users, groups, roles, permissions, and policies. It is a central place to configure access.

Talend Usage:

  • Talend components for S3 can inherit roles and permissions defined in IAM for accessing S3 resources. The definition must be done in IAM, but it is used when connecting to S3. There is no direct link or connectivity to IAM. It is part of the security setup for the EC2 security inheritance.
AWS Directory Service

AWS Directory Service provides Microsoft Active Directory (AD) in the cloud, or connect to an on-premises Microsoft Active Directory to manage your AWS resources.

Talend Usage:

  • Talend Administration Center can use LDAP to delegate users authentication. Hence, potentially, we can configure Talend Administration Center to use the AWS Directory (although this still needs to be tested).

Talend Architecture in AWS

AWS Cloud provides the following 3 benefits to customers:

  • Ease of deployment
  • Scale on demand
  • Minimize cost (i.e. only pay for the computing and storage resources you use)

As we deploy Talend in AWS, it is imperative to architect the platform correctly so as to maximize the benefits mentioned above. The following architecture diagram shows how we can deploy Talend in Amazon Cloud:

Environments

Talend recommends deploying a Development, Test/UAT and a Production environment in AWS. In the above architecture diagram, we show such a setup with 3 environments. The Test/UAT environment should be a replication of the Production environment. For organisations with many parallel projects and development teams, it may be necessary to have multiple Test and UAT environments to minimize dependency between teams when running UAT testing.

Each environment should be completely segregated from each other, except for the access to Nexus snapsots and releases repositories. In certain scenarios, it is possible to have 1 releases repository for Test/UAT environment and 1 releases repository for Production environment. In this case, we can control access to the Nexus from the Test/UAT and Production environment through firewall access.

Region

The architecture above is to be deployed within 1 AWS Region. You can replicate the same architecture in different regions. However, you should be careful NOT to cluster Talend Administration Center across AWS Region due to network latency and other network related issues that may arise.

Availability Zone

It is a best practice from AWS to deploy the platform in 2 availability zones at a minimum for Production environments. The more availability zones you use, the better availability you will have for the production environment.

In the above architecture, we deploy Talend Administration Center in EC2 instances in 2 availability zones. In Production, depending on the critical nature of the platform, we can have 1 Talend Administration Center running or both Talend Administration Centers running. We can also configure an AutoScaling group in AWS so that a second Talend Administration Center is started if the first one is not available anymore.

We also have Job Servers in 2 availability zones, for greater platform availability and ability to execute more jobs.

AutoScaling

AutoScaling is optional, but can be used to spin up another instance for Talend Administration Center in case one or more of the current instances of Talend Administration Center are not available anymore. To effectively configure this, we need to proper configure the Talend Administration Center Server instance, and take an AMI of it. The Talend Administration Center should be setup so that each instance that is started join the cluster and share the Job Conductor and other settings.

Databases

The Talend Administration Center (TAC) database should be hosted in RDS. RDS provides high availability of the databases and makes it transparent to switch from the master db to the slave db in case the master is not available. Amazon handle the routing of the traffic to the right instance. Having the Talend Administration Center database hosted in RDS will enable us to deploy additional Talend Administration Center and cluster the Talend Administration Center together with the same database for the admin metadata.

Talend Administration Center

The Talend Administration Center (TAC) will be running in an EC2 instance. The Talend Administration Center manages the scheduling of tasks and therefore should be an always on service. We have a Talend Administration Center in each environment so that we can better control access to the tasks configuration and scheduling. Also, the Talend Administration Center in Development environment is used mostly to manage the shared team development, while in Test and Production it is used to schedule jobs.

It is recommended to have 2 Talend Administration Centers deployed in UAT and Production to improve availability and resilience of the platform in case of failure of 1 Talend Administration Center. Having 2 Talend Administration Centers also helps to limit downtime during migration activities since we can power 1 Talend Administration Center down and upgrade while the other instance keeps running.

Job Server

The Job Server EC2 Instance does not need to be running all the time. The only time we need the Job Server EC2 instance is when we have jobs to run and process data. Talend provides features to automatically start, stop and terminate job servers depending on the scheduling of jobs.

For more information about how to execute data integration Jobs on a server based on Amazon EC2, see the Talend Administration Center User Guide.

Talend Studio

In the above architecture, we expect that developers will be running Talend Studio on their local workstation. Hence, their studio will connect to Talend Administration Center in development and Git/SVN. It is also possible to instantiate EC2 servers for running the Talend Studio, if the connectivity between the developers workstation and the AWS network is slow.

Browser/Operators

Ops and support team will use their browser to access Talend Administration Center, Jenkins, Nexus, etc.

GIT

The Git instance can be running on 1 or more EC2 instances. If possible, we will leverage an existing Git installation if it already exists. If not, then we will have a Git setup on one EC2 instance. Git access is needed for the Development environment only.

Jenkins

Jenkins will be running on 1 or more EC2 instances. If possible, we will leverage an existing Jenkins to drive the Continuous Integration and Continuous Deployment.

Nexus

Nexus will be running on 1 EC2 instance. It will have multiple repositories as below:

  • snapshots repository for Development environment
  • releases-qa repository for Test/UAT environment
  • releases-prod for Production environment

Amazon Redshift

Amazon Redshift is depending on the region and availability zone. It is possible to configure Redshift for Multi-AZ (multi-availability zone) setup. Hence, it is shown in the architecture above as an example. If your project(s) does not require Redshift, then you can ignore it.