Configuring the partitioning step - Cloud - 7.3

Talend Studio User Guide

Version
Cloud
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-02-13
Available in...

Big Data

Big Data Platform

Cloud API Services Platform

Cloud Big Data

Cloud Big Data Platform

Cloud Data Fabric

Cloud Data Management Platform

Data Fabric

Data Management Platform

Data Services Platform

MDM Platform

Real-Time Big Data Platform

Procedure

  1. Click the link representing the partitioning step to open its Component view and click the Parallelization tab.
    The Partition row option has been automatically selected in the Type area. If you select None, you are actually disabling parallelization for the data flow to be handled over this link. Note that depending on the link you are configuring, a Repartition row option may become available in the Type area to repartition a data flow already departitioned.
    In this Parallelization view, you need to define the following properties:
    • Number of Child Threads: the number of threads you want to split the input records up into. We recommend that this number be N-1 where N is the total number of CPUs or cores on the machine processing the data.

    • Buffer Size: the number of rows to cache for each of the threads generated.

    • Use a key hash for partitions: this allows you to use the hash mode to dispatch the input records into threads.

      Once selecting it, the Key Columns table appears, in which you set the column(s) you want to apply the hash mode on. In the hash mode, the records meeting the same criteria are dispatched into the same threads.

      If you leave this check box clear, the dispatch mode is Round-robin, meaning records are dispatched one-by-one to each thread, in a circular fashion, until the last record is dispatched. Be aware that this mode cannot guarantee that records meeting the same criteria go into the same threads.

  2. In the Number of Child Threads field, enter the number of the threads you want to partition the data flow into. In this example, enter 3 because we are using 4 processors to run this Job.
  3. If required, change the value in the Buffer Size field to adapt the memory capacity. In this example, we leave the default one.