tMahoutClustering in Talend Map/Reduce Jobs - 6.1

Talend Components Reference Guide

Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
Talend Studio
Data Governance
Data Quality and Preparation
Design and Development


The information in this section is only for users that have subscribed to one of the Talend solutions with Big Data and is not applicable to Talend Open Studio for Big Data users.

In a Talend Map/Reduce Job, tMahoutClustering, as well as the other Map/Reduce components preceding it, generates native Map/Reduce code. This section presents the specific properties of tMahoutClustering when it is used in that situation. For further information about a Talend Map/Reduce Job, see Talend Big Data Getting Started Guide.

Component family


This component is deprecated and hidden from the Palette by default, but it will continue to work in Jobs you import from older releases. However you must use JDK 7 to be able to run migrated Jobs with tMahoutClustering successfully.

For information about how to show a hidden component on the Palette, see Talend Studio User Guide.

The Spark Batch component tKMeansModel is recommended to replace tMahoutClustering to execute clustering algorithms on datasets.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

The output schema of tMahoutClustering provides one read-only column, ClusterID.



Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.



Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

File configuration

Input HDFS file

Browse to the HDFS file that holds the numerical data to be processed.


Field separator

Enter a character, string or regular expression to separate fields in the input and output data.


Cluster columns

In the Input Column, select the column(s) from the main flow on which you want to define clustering algorithms. These columns are used to calculate the clusters.

You can add only numerical columns to this table.

Clustering Configuration

Clustering type

Select the relevant clustering algorithm from the list:

Canopy: this algorithm uses an approximate distance metric and two distance thresholds T 1 and T 2 ,where T 1 >T 2. It starts with a set of data points in any order, picks a point called the centroid of the cluster and approximately measures its distance to all other points. It puts all points that are within distance threshold T 1 into a canopy. It removes from the main set all points that are within distance threshold T 2. This way points that are very close to the centroid will avoid all further processing. The algorithm then chooses a second centroid among the data points in the principal set. It continues until the initial set is empty, accumulating a set of Canopies, each containing one or more points. A given point may occur in more than one Canopy.

Canopy clustering is often used as an initial step in more rigorous clustering techniques, such as K-Means clustering . By starting with Canopy clustering the number of more expensive distance measurements can be significantly reduced by ignoring points outside of the initial canopies.

K-Means: it sorts a given data set into a number of clusters, the number of which you must define. The algorithm chooses k random points, used as centroids of k clusters.

The algorithm then associates each data point belonging to a given data set to the nearest cluster center.

Fuzzy K-Means: also called Fuzzy C-Means: it belongs to the family of fuzzy-logic clustering algorithms. It works like K-Means but recomputes the cluster centers using the probability of a point belonging to two or more clusters.


Distance measure

Select from the list the distance measure you want to use for clustering:

Euclidean: defines the "ordinary" distance between two points, as if measured with a ruler.

Manhattan: defines the distance between two points if a grid-like path is followed.

Chebyshev: defines the maximum distance between two vectors taken on any of the coordinate dimensions.

Cosine: uses the cosine of the angle between the two vectors representing the points to be compared.


Canopy threshold1

The threshold of distance T1 used for the Canopy algorithm.


Canopy threshold2

The threshold of distance T2 used for the Canopy algorithm.


Number of clusters

Enter the maximum number of clusters that can be generated by a clustering algorithm. Some clusters may not have data.


Max iterations

Enter the maximum number of iterations to be carried out for a clustering algorithm.


Convergence delta

Enter a rate of convergence for the algorithm. It must be between 0.0 and 1.0. The greater the rate is, the faster the algorithm is but results will be less precise.



Enter the fuzziness parameter for the Fuzzy K-Means algorithm. It must be greater or equal to 1.0.

When the fuzziness is close to 1, then the cluster center closest to the point is given much more weight than the others, and the algorithm is similar to K-Means.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.


tMahoutClustering must be the start component in a Job. You can select an input HDFS file from its basic settings.