tMahoutClustering MapReduce properties

Machine Learning

Talend Documentation Team
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Big Data
Data Quality and Preparation > Third-party systems > Machine Learning components
Data Governance > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
Talend Studio

These properties are used to configure tMahoutClustering running in the MapReduce Job framework.

The MapReduce tMahoutClustering component belongs to the MapReduce family.

This component is available in Talend Platform products with Big Data and in Talend Data Fabric.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to Repository. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

The output schema of tMahoutClustering provides one read-only column, ClusterID.


Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.


Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

Input HDFS file

Browse to the HDFS file that holds the numerical data to be processed.

Field separator

Enter a character, string or regular expression to separate fields in the input and output data.

Cluster columns

In the Input Column, select the column(s) from the main flow on which you want to define clustering algorithms. These columns are used to calculate the clusters.

You can add only numerical columns to this table.

Clustering type

Select the relevant clustering algorithm from the list:

Canopy: this algorithm uses an approximate distance metric and two distance thresholds T 1 and T 2 ,where T 1 >T 2. It starts with a set of data points in any order, picks a point called the centroid of the cluster and approximately measures its distance to all other points. It puts all points that are within distance threshold T 1 into a canopy. It removes from the main set all points that are within distance threshold T 2. This way points that are very close to the centroid will avoid all further processing. The algorithm then chooses a second centroid among the data points in the principal set. It continues until the initial set is empty, accumulating a set of Canopies, each containing one or more points. A given point may occur in more than one Canopy.

Canopy clustering is often used as an initial step in more rigorous clustering techniques, such as K-Means clustering . By starting with Canopy clustering the number of more expensive distance measurements can be significantly reduced by ignoring points outside of the initial canopies.

K-Means: it sorts a given data set into a number of clusters, the number of which you must define. The algorithm chooses k random points, used as centroids of k clusters.

The algorithm then associates each data point belonging to a given data set to the nearest cluster center.

Fuzzy K-Means: also called Fuzzy C-Means: it belongs to the family of fuzzy-logic clustering algorithms. It works like K-Means but recomputes the cluster centers using the probability of a point belonging to two or more clusters.

Distance measure

Select from the list the distance measure you want to use for clustering:

Euclidean: defines the "ordinary" distance between two points, as if measured with a ruler.

Manhattan: defines the distance between two points if a grid-like path is followed.

Chebyshev: defines the maximum distance between two vectors taken on any of the coordinate dimensions.

Cosine: uses the cosine of the angle between the two vectors representing the points to be compared.

Canopy threshold1

The threshold of distance T1 used for the Canopy algorithm.

Canopy threshold2

The threshold of distance T2 used for the Canopy algorithm.

Number of clusters

Enter the maximum number of clusters that can be generated by a clustering algorithm. Some clusters may not have data.

Max iterations

Enter the maximum number of iterations to be carried out for a clustering algorithm.

Convergence delta

Enter a rate of convergence for the algorithm. It must be between 0.0 and 1.0. The greater the rate is, the faster the algorithm is but results will be less precise.


Enter the fuzziness parameter for the Fuzzy K-Means algorithm. It must be greater or equal to 1.0.

When the fuzziness is close to 1, then the cluster center closest to the point is given much more weight than the others, and the algorithm is similar to K-Means.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.


Usage rule

tMahoutClustering is deprecated. You must use JDK 7 to be able to run migrated Jobs with tMahoutClustering successfully. If you need to execute clustering algorithms, it is recommended to create a Spark Batch Job and use tKMeansModel instead in that Job.