Big Data - Getting Started With Your First Job

author
Irshad Burtally
EnrichVersion
6.4
6.3
6.2
6.1
6.0
EnrichProdName
Talend Big Data
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Big Data Platform
task
Design and Development
Design and Development > Designing Jobs > Hadoop distributions
EnrichPlatform
Talend Studio

Big Data - Getting Started With Your First Job

This article provides guidance on how to get started with your first Big Data Job.

The Big Data ecosystem is thriving and continuously undergoing changes. It is large, complex, and sometimes redundant due to the many standards, engines, frameworks and distributions available in this landscape.

Talend provides developers with tools that will increase their productivity while abstracting the complexity that comes with integrating with such Big Data platform.

Big Data Reference Architecture

The first step in a successful implementation is to get the architecture right. The diagram below shows a high level Big Data reference architecture with Talend. The diagram only depicts the important components and connections.
Note: Some connections are not shown in the diagram for the sake of clarity for the diagram.
You should take the following best practices into considerations when planning your architecture:
  • Plan and implement a DEV, TEST and PROD environments.
  • All environments may be using the same cluster or different cluster.
  • Use the same big data distribution and version across the multiple environments.
  • Configure the same security access and protocols across all environments. Use different user credentials for executing Big Data jobs in each environment.
  • All servers should be Linux based, except for the developer workstation.
  • Talend JobServer agent should be installed on the edge nodes to keep your configuration simple. Talend JobServer can also be installed on nodes which are not part of the cluster, with increased complexity in configuration.
  • Use Continuous Integration, and plan a CI Server like Jenkins for building your jobs.
  • All Talend Administration Center (TAC) will connect and access Nexus Snaphots and/or Releases repositories. It is preferable to have 1 Nexus instance with multiple repositories defined.
  • Only Talend Administration Center (TAC) in DEV environment will access Git/Subversion.
  • Implement separation of concerns, i.e. setup one or more Execution Servers with JobServer agent for non big-data ETL Jobs. Big Data Jobs should be executed on the edge nodes.
  • Developers should configure remote/distant run to execute their Big Data Jobs on the edge nodes.
  • Edge nodes should be configured with users that are allowed access to the data. The tasks defined in Talend Administration Center should use Run As feature to specify the user to use to run the Big Data Job.
  • Pay attention to the Java version being on the various servers. Talend requires Java 8 to run. However, Talend Studio can generate Java 7 compatible code for clusters running on Java 7.

For more information on all the other components of the Talend architecture, please refer to the Architecture of the Talend products documentation. Follow the Talend Installation Guides to install and configure Talend Studio, Talend Administration Center, Talend JobServer in your environment.

How to get started with your first Big Data Job

It is often easier to start building Talend jobs against a local big data distribution running on the developer workstation. However, the best practice is to have a cluster for development. The cluster should mirror the same configuration as production, i.e. High Availability, Security, Kerberos, etc.

Leverage Your Big Data Distribution

The following articles provide step by step guidance on how to get started with your own big data distribution running either in a VM on your workstation or in AWS for EMR:

For a more in-depth example, see the article Machine Learning 101 - Decision Trees.

Security

The article How to use Kerberos in Talend Studio with Big Data v6.x will help you configure your Talend ETL jobs to use Kerberos to connect to your big data distribution.

Cluster Setup in AWS

You should ensure that your edge node is properly configured as part of the cluster, and that the necessary ports are opened between the edge node and the Talend Administration Center.