Amazon Web Services (AWS) - Getting Started
Overview
Amazon Web Services (AWS) provides on-demand computing resources and services in the cloud, with pay-as-you-go pricing, to individuals, companies and governments. The AWS documentation is very extensive and is a good way to get started with the basic services that AWS provides.
In this article, we will outline those services where you can leverage Talend for your data integration requirements. We will describe one reference architecture for deploying Talend within the AWS Cloud Computing platform and highlight some best practices to adopt.
AWS Global Infrastructure - Regions and Availability Zones
The AWS Global Infrastructure is built around Regions and Availability Zones (AZ). At the time of writing this article, i.e. June 2016, AWS is present around the globe in 12 geographic regions. These 12 physical geographic regions are:
Americas | Europe/Middle East/Africa | Asia Pacific |
---|---|---|
|
|
|
AWS operates 33 Availability Zones within the 12 geographic Regions mentioned above as described here.
Networking
AWS Cloud provides 3 Networking services namely:
Icon | Name | Description |
---|---|---|
|
Amazon VPC | Amazon Virtual Private Cloud (VPC) lets you define private computing resources that are logically isolated in a virtual network that you own. You can easily control access to and the traffic in and out of the virtual private network. It is secured and presents an easy way to segregate various environments. For example, a Talend customer looking to setup multiple environments will use different VPC for each environment. |
|
AWS Direct Connect | AWS Direct Connect makes it easy to establish dedicated network connection from an on-premise network to AWS. It has many benefits like reducing bandwidth costs, and providing consistent network performance than internet-based connections. This is generally transparent for Talend usage. The IT group of the company will set this up. |
|
Amazon Route 53 | Amazon Route 53 is a Domain Name System (DNS) web service. This service connects user request to services running within the Cloud. It allows the use of a fully qualified domain name instead of an IP Address. |
While Talend is transparent to all Networking services, it is important to understand how they work since they relate to connectivity. The security, ports, firewall, DNS name, and other networking configuration will allow or restrict access to services running within EC2 instances and AWS Cloud. Hence, they need to be configured properly for Talend Jobs and Services to be able to connect to resources like S3 Bucket, EMR Cluster, Databases, etc.
Compute
AWS Cloud provides the following 3 main Compute services:
Icon | Name | Description |
---|---|---|
|
Amazon EC2 | Amazon Elastic Compute Cloud (EC2) is a web service that provides the ability to
commission server instances within minutes. You can instantiate Windows and Linux
servers with configurable RAM, CPU and Disk space. Server instances can be started in
various modes to optimise cost versus computing power needed. A Talend customer
will leverage the EC2 service to instantiate servers for deploying the Talend
platform. Talend Usage:
|
|
Amazon EC2 Container Service | Amazon EC2 Container Service (ECS) is a container management service that supports
Docker containers. The Docker containers are started on an EC2 instance within the
customer's own VPC, thereby providing a high level of isolation for your applications.
It is possible to deploy Talend within
Docker containers that will run within the ECS. Talend Usage:
|
|
Elastic Beanstalk | Amazon Elastic Beanstalk is a service for deploying web applications and services
developed in languages like Java, .Net, PHP, Node.js, Python and Docker on familiar
server like Apache, Nginx, Passenger and IIS. It is a service that only requires the
developer to upload the application binaries. The services handles the deployment,
capacity provisioning, load balancing and auto-scaling according to the application
health monitoring. Talend Usage:
|
Databases
AWS Cloud provides the following 4 categories of Database services:
Icon | Name | Description |
---|---|---|
|
Amazon RDS |
Amazon Relational Database Service (RDS) is an easy to operate, scalable relational database in the cloud. Amazon RDS make it easy to deploy high available database instances based on Amazon Aurora, Oracle, Microsoft SQL Server, PostgreSQL, MySQL and MariaDB. Customers can deploy Master and Slave topologies for their database needs and let RDS to handle the failover and high-availability. Multiple instances can be deployed in different availability zones for a Multi-Zone configuration. Talend Usage:
|
|
Amazon Redshift |
Amazon Redshift is a columnar storage technology that has been optimized for petabyte-scale data warehouse projects. The architecture of the service allows customers to automate the most common tasks of provisioning, configuring, and monitoring a cloud data warehouse. The data warehouse content can be configured to automatically and incrementally back up into Amazon S3. Talend Usage:
|
|
Amazon DynamoDB |
Amazon DynamoDB is a NoSQL database that supports both document and key-value storage models. Talend Usage:
|
Storage
AWS Cloud provides the following Storage services:
Icon | Name | Description |
---|---|---|
|
Amazon S3 |
Amazon Simple Storage Service (S3) provides customers with an object storage service. It is the storage infrastructure that AWS uses for various other services. Amazon S3 provides a range of storage classes designed for different use cases including:
These are controlled by configurable policies on the data. Talend Usage:
|
|
Amazon Glacier |
As mentioned above, Amazon Glacier is built on top of S3. It leverages the same infrastructure. Amazon Glacier is a service for a secure, durable and extremely low cost storage service for long-term backup and data archiving. Retrieving data from Amazon Glacier will take several hours for the request to be processed due to the nature of the service. It is like a replacement for tape. Customer should not use Glacier for frequently accessed data. The use-case is to retrieve data from Glacier once every few years, in case of a disaster recover. Talend Usage:
|
|
Amazon EBS |
Amazon Elastic Block Store (EBS) provides persistent block level storage volumes for use with Amazon EC2 instances in the cloud. Amazon EBS provides raw block IO access and is suitable to be attached to 1 server instance. If we need storage to be attached to multiple instances, then we need Amazon Elastic File System (EFS) which exposes the NFSv4 protocol. Talend Usage:
|
|
Amazon EFS |
Amazon Elastic File System (EFS) is a simple, scalable file storage system, exposing the NFSv4 protocol, for use with multiple EC2 instances at the same time in the cloud. With Amazon EFS, the storage capacity is elastic, growing and shrinking automatically as we add or remove files from system. We use Amazon EFS for the Job Archives folder when multiple Talend Administration Center (TAC) are clustered at the scheduler level. Talend Usage:
|
Analytics
AWS Cloud provides the following Analytics services:
Icon | Name | Description |
---|---|---|
|
Amazon EMR |
Amazon Elastic MapReduce (EMR) is a managed Hadoop framework provided as a service. It runs across multiple EC2 instances and can run distributed frameworks such as Apache Spark. Talend Usage:
|
|
Amazon Kinesis |
Amazon Kinesis is a platform for streaming data on AWS. It is basically a Saas technology that behaves very similarly to Apache Kafka with a high-throughput, distributed message queue, publish-subscribe system. Generally, applications connecting to Amazon Kinesis will be leveraging Spark Streaming. Talend Usage:
|
|
Amazon ElasticSearch Service |
Amazon Elasticsearch is a managed service for the Elastic Search engine. Elastic Search is a popular open-source analytics engine which is used very frequently with Logstash and Kibana for log analytics and real-time monitoring. Talend Usage:
|
Security
AWS Cloud provides the following Security services:
Icon | Name | Description |
---|---|---|
|
AWS IAM |
AWS Identity and Access Management (IAM) enables you to control access to the AWS services and resources for your users and processes. IAM provides the functionality to create users, groups, roles, permissions, and policies. It is a central place to configure access. Talend Usage:
|
|
AWS Directory Service |
AWS Directory Service provides Microsoft Active Directory (AD) in the cloud, or connect to an on-premises Microsoft Active Directory to manage your AWS resources. Talend Usage:
|
Talend Architecture in AWS
AWS Cloud provides the following 3 benefits to customers:
- Ease of deployment
- Scale on demand
- Minimize cost (i.e. only pay for the computing and storage resources you use)
As we deploy Talend in AWS, it is imperative to architect the platform correctly so as to maximize the benefits mentioned above. The following architecture diagram shows how we can deploy Talend in Amazon Cloud:
Environments
Talend recommends deploying a Development, Test/UAT and a Production environment in AWS. In the above architecture diagram, we show such a setup with 3 environments. The Test/UAT environment should be a replication of the Production environment. For organisations with many parallel projects and development teams, it may be necessary to have multiple Test and UAT environments to minimize dependency between teams when running UAT testing.
Each environment should be completely segregated from each other, except for the access to Nexus snapsots and releases repositories. In certain scenarios, it is possible to have 1 releases repository for Test/UAT environment and 1 releases repository for Production environment. In this case, we can control access to the Nexus from the Test/UAT and Production environment through firewall access.
Region
The architecture above is to be deployed within 1 AWS Region. You can replicate the same architecture in different regions. However, you should be careful NOT to cluster Talend Administration Center across AWS Region due to network latency and other network related issues that may arise.
Availability Zone
It is a best practice from AWS to deploy the platform in 2 availability zones at a minimum for Production environments. The more availability zones you use, the better availability you will have for the production environment.
In the above architecture, we deploy Talend Administration Center in EC2 instances in 2 availability zones. In Production, depending on the critical nature of the platform, we can have 1 Talend Administration Center running or both Talend Administration Centers running. We can also configure an AutoScaling group in AWS so that a second Talend Administration Center is started if the first one is not available anymore.
We also have Job Servers in 2 availability zones, for greater platform availability and ability to execute more jobs.
AutoScaling
AutoScaling is optional, but can be used to spin up another instance for Talend Administration Center in case one or more of the current instances of Talend Administration Center are not available anymore. To effectively configure this, we need to proper configure the Talend Administration Center Server instance, and take an AMI of it. The Talend Administration Center should be setup so that each instance that is started join the cluster and share the Job Conductor and other settings.
Databases
The Talend Administration Center (TAC) database should be hosted in RDS. RDS provides high availability of the databases and makes it transparent to switch from the master db to the slave db in case the master is not available. Amazon handle the routing of the traffic to the right instance. Having the Talend Administration Center database hosted in RDS will enable us to deploy additional Talend Administration Center and cluster the Talend Administration Center together with the same database for the admin metadata.
Talend Administration Center
The Talend Administration Center (TAC) will be running in an EC2 instance. The Talend Administration Center manages the scheduling of tasks and therefore should be an always on service. We have a Talend Administration Center in each environment so that we can better control access to the tasks configuration and scheduling. Also, the Talend Administration Center in Development environment is used mostly to manage the shared team development, while in Test and Production it is used to schedule jobs.
It is recommended to have 2 Talend Administration Centers deployed in UAT and Production to improve availability and resilience of the platform in case of failure of 1 Talend Administration Center. Having 2 Talend Administration Centers also helps to limit downtime during migration activities since we can power 1 Talend Administration Center down and upgrade while the other instance keeps running.
Job Server
The Job Server EC2 Instance does not need to be running all the time. The only time we need the Job Server EC2 instance is when we have jobs to run and process data. Talend provides features to automatically start, stop and terminate job servers depending on the scheduling of jobs.
For more information about how to execute data integration Jobs on a server based on Amazon EC2, see the Talend Administration Center User Guide.
Talend Studio
In the above architecture, we expect that developers will be running Talend Studio on their local workstation. Hence, their studio will connect to Talend Administration Center in development and Git/SVN. It is also possible to instantiate EC2 servers for running the Talend Studio, if the connectivity between the developers workstation and the AWS network is slow.
Browser/Operators
Ops and support team will use their browser to access Talend Administration Center, Jenkins, Nexus, etc.
GIT
The Git instance can be running on 1 or more EC2 instances. If possible, we will leverage an existing Git installation if it already exists. If not, then we will have a Git setup on one EC2 instance. Git access is needed for the Development environment only.
Jenkins
Jenkins will be running on 1 or more EC2 instances. If possible, we will leverage an existing Jenkins to drive the Continuous Integration and Continuous Deployment.
Nexus
Nexus will be running on 1 EC2 instance. It will have multiple repositories as below:
- snapshots repository for Development environment
- releases-qa repository for Test/UAT environment
- releases-prod for Production environment
Amazon Redshift
Amazon Redshift is depending on the region and availability zone. It is possible to configure Redshift for Multi-AZ (multi-availability zone) setup. Hence, it is shown in the architecture above as an example. If your project(s) does not require Redshift, then you can ignore it.