Big Data - Getting Started With Your First Job
The Big Data ecosystem is thriving and continuously undergoing changes. It is large, complex, and sometimes redundant due to the many standards, engines, frameworks and distributions available in this landscape.
Talend provides developers with tools that will increase their productivity while abstracting the complexity that comes with integrating with such Big Data platform.
Big Data Reference Architecture
- Plan and implement a DEV, TEST and PROD environments.
- All environments may be using the same cluster or different cluster.
- Use the same big data distribution and version across the multiple environments.
- Configure the same security access and protocols across all environments. Use different user credentials for executing Big Data jobs in each environment.
- All servers should be Linux based, except for the developer workstation.
- Talend JobServer agent should be installed on the edge nodes to keep your configuration simple. Talend JobServer can also be installed on nodes which are not part of the cluster, with increased complexity in configuration.
- Use Continuous Integration, and plan a CI Server like Jenkins for building your jobs.
- All Talend Administration Center (TAC) will connect and access Nexus Snaphots and/or Releases repositories. It is preferable to have 1 Nexus instance with multiple repositories defined.
- Only Talend Administration Center (TAC) in DEV environment will access Git/Subversion.
- Implement separation of concerns, i.e. setup one or more Execution Servers with JobServer agent for non big-data ETL Jobs. Big Data Jobs should be executed on the edge nodes.
- Developers should configure remote/distant run to execute their Big Data Jobs on the edge nodes.
- Edge nodes should be configured with users that are allowed access to the data. The tasks defined in Talend Administration Center should use Run As feature to specify the user to use to run the Big Data Job.
- Pay attention to the Java version being on the various servers. Talend requires Java 8 to run. However, Talend Studio can generate Java 7 compatible code for clusters running on Java 7.
For more information on all the other components of the Talend architecture, please refer to the Architecture of the Talend products documentation. Follow the Talend Installation Guides to install and configure Talend Studio, Talend Administration Center, Talend JobServer in your environment.
How to get started with your first Big Data Job
Leverage Your Big Data Distribution
For a more in-depth example, see the article Machine Learning 101 - Decision Trees.
Security
The article How to use Kerberos in Talend Studio with Big Data v6.x will help you configure your Talend ETL jobs to use Kerberos to connect to your big data distribution.
Cluster Setup in AWS
You should ensure that your edge node is properly configured as part of the cluster, and that the necessary ports are opened between the edge node and the Talend Administration Center.