These properties are used to configure tMahoutClustering running in the MapReduce Job framework.
The MapReduce tMahoutClustering component belongs to the MapReduce family.
This component is available in Talend Platform products with Big Data and in Talend Data Fabric.
The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.
Basic settings
Schema and Edit schema 
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
The output schema of tMahoutClustering provides one readonly column, ClusterID. 

BuiltIn: You create and store the schema locally for this component only. 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. 
Input HDFS file 
Browse to the HDFS file that holds the numerical data to be processed. 
Field separator 
Enter a character, string or regular expression to separate fields in the input and output data. 
Cluster columns 
In the Input Column, select the column(s) from the main flow on which you want to define clustering algorithms. These columns are used to calculate the clusters. You can add only numerical columns to this table. 
Clustering type 
Select the relevant clustering algorithm from the list: Canopy: this algorithm uses an approximate distance metric and two distance thresholds T 1 and T 2 ,where T 1 >T 2. It starts with a set of data points in any order, picks a point called the centroid of the cluster and approximately measures its distance to all other points. It puts all points that are within distance threshold T 1 into a canopy. It removes from the main set all points that are within distance threshold T 2. This way points that are very close to the centroid will avoid all further processing. The algorithm then chooses a second centroid among the data points in the principal set. It continues until the initial set is empty, accumulating a set of Canopies, each containing one or more points. A given point may occur in more than one Canopy. Canopy clustering is often used as an initial step in more rigorous clustering techniques, such as KMeans clustering . By starting with Canopy clustering the number of more expensive distance measurements can be significantly reduced by ignoring points outside of the initial canopies. KMeans: it sorts a given data set into a number of clusters, the number of which you must define. The algorithm chooses k random points, used as centroids of k clusters. The algorithm then associates each data point belonging to a given data set to the nearest cluster center. Fuzzy KMeans: also called Fuzzy CMeans: it belongs to the family of fuzzylogic clustering algorithms. It works like KMeans but recomputes the cluster centers using the probability of a point belonging to two or more clusters. 
Distance measure 
Select from the list the distance measure you want to use for clustering: Euclidean: defines the "ordinary" distance between two points, as if measured with a ruler. Manhattan: defines the distance between two points if a gridlike path is followed. Chebyshev: defines the maximum distance between two vectors taken on any of the coordinate dimensions. Cosine: uses the cosine of the angle between the two vectors representing the points to be compared. 
Canopy threshold1 
The threshold of distance T1 used for the Canopy algorithm. 
Canopy threshold2 
The threshold of distance T2 used for the Canopy algorithm. 
Number of clusters 
Enter the maximum number of clusters that can be generated by a clustering algorithm. Some clusters may not have data. 
Max iterations 
Enter the maximum number of iterations to be carried out for a clustering algorithm. 
Convergence delta 
Enter a rate of convergence for the algorithm. It must be between 0.0 and 1.0. The greater the rate is, the faster the algorithm is but results will be less precise. 
Fuzziness 
Enter the fuzziness parameter for the Fuzzy KMeans algorithm. It must be greater or equal to 1.0. When the fuzziness is close to 1, then the cluster center closest to the point is given much more weight than the others, and the algorithm is similar to KMeans. 
Global Variables
Global Variables 
ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide. 
Usage
Usage rule 
tMahoutClustering is deprecated. You must use JDK 7 to be able to run migrated Jobs with tMahoutClustering successfully. If you need to execute clustering algorithms, it is recommended to create a Spark Batch Job and use tKMeansModel instead in that Job. 