Enabling parallelization of data flows - Cloud - 8.0

Talend Studio User Guide

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-02-29
Available in...

Big Data

Big Data Platform

Cloud API Services Platform

Cloud Big Data

Cloud Big Data Platform

Cloud Data Fabric

Cloud Data Management Platform

Data Fabric

Data Management Platform

Data Services Platform

MDM Platform

Real-Time Big Data Platform

In Talend Studio, parallelization of data flows means to partition an input data flow of a subJob into parallel processes and to simultaneously execute them, so as to gain better performance. These processes are executed always in a same machine.

Note that this type of parallelization is available only on the condition that you have subscribed to one of the Platform solutions or Big Data solutions.

You can use dedicated components or the Set parallelization option in the contextual menu within a Job to implement this type of parallel execution.

The dedicated components are tPartitioner, tCollector, tRecollector and tDepartitioner.

The following sections explains how to use the Set parallelization option and the related Parallelization vertical tab associated with a Row connection.

You can enable or disable the parallelization by one single click, and then Talend Studio automates the implementation across a given Job.

Job in the design workspace.

The implementation of the parallelization requires four key steps as explained as follows:

  1. Partitioning (Partition): In this step, Talend Studio splits the input records into a given number of threads.
  2. Collecting (Collect): In this step, Talend Studio collects the split threads and sends them to a given component for processing.
  3. Departitioning (Departition): In this step, Talend Studio groups the outputs of the parallel executions of the split threads.
  4. Recollecting (Recollect): In this step, Talend Studio captures the grouped execution results and outputs them to a given component.

Once the automatic implementation is done, you can alter the default configuration by clicking the corresponding connection between components.