Execution and Big Data Proxy Execution

The Remote Engine Gen2 component is used as follows:

Talend Cloud Pipeline Designer: Live preview, access datasets, execute Pipelines
Talend Cloud Data Inventory: Create connections / datasets, samples
Talend Cloud Data Preparation: Access datasets

The Remote Engine Gen2 is a docker image, so the deployment options include deploying onto a Virtual Machine running docker or (preferably) deploying directly to the container orchestration service of choice. Either way – the process of setting up a Remote Engine Gen2 can (and should) be fully automated by your own DevOps team.

There are two options for deployment of the IPP Server

Spark local – Pipeline execution on a single machine, no external compute dependencies but no horizontal scaling. This option is found on the IPP Server on the Reference Architecture diagrams.
Deploy on an edge node – that is, a machine with access to a big data cluster such as Databricks and AWS EMR. The actual compute is done on the cluster and Remote Engine Gen2 is a runner that is used to instantiate the process. The machine from which this runner executes is commonly referred to as an Edge Node, because it has the network placement, security permissions and so on required to access a big data cluster. This option is found on the IPP Edge Node on the Reference Architecture diagrams.

Assuming enough Remote Engine tokens are available you could chose to deploy following one or both patterns, or even multiple instances of each pattern. For example, if two different teams required specific placement of their Remote Engine Gen2 in order to gain access to their sources and targets, each team could have an IPP Server and / or an IPP Edge Node.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here