Configuring the clustering process

Procedure

Double-click tMahoutClustering to open its Component view.
From the Schema list, select Built-In and then click the [...] button next to Edit Schema and describe the data structure in the input file.
Add eight rows to the schema dialog box and define the input data as shown in the above capture.
The component has one read-only column, clusterID.
Click OK.
In the File Configuration area:
- Click the [...] button next to the Input HDFS file and browse to the HDFS file on the Hadoop system that holds the input numerical data you want to cluster.
- Set the field separator used to separate the columns in the clustered data.
- In the Cluster columns table, add rows to the table and click in each row to select a column from the input schema.
In the Clustering Configuration area:
- From the Clustering Type list, select what algorithm you want to use to cluster the numerical data, Fuzzy K-means in this example.
- From the Distance Measure list, select the distance measure you want to use for clustering.
- In the Number of clusters field, enter 3.
- Leave the values in Max iterations and Convergence delta as they are.

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!