Preparing features for KMeans

Machine Learning

Talend Documentation Team
Talend Real-Time Big Data Platform
Talend Data Fabric
Talend Big Data
Talend Big Data Platform
Data Quality and Preparation > Third-party systems > Machine Learning components
Data Governance > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
Talend Studio


  1. Double-click the tModelEncoder component to open its Component view.
  2. Click the [...] button next to Edit schema and on the tModelEncoder side of the pop-up schema dialog box, define the schema by adding one column named map of Vector type.
  3. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.
  4. In the Transformations table, add one row by clicking the [+] button and then proceed as follows:
    1. In the Output column column, select the column that carry the features. In this scenario, it is map.
    2. In the Transformation column, select the algorithm to be used for the transformation. In this scenario, it is Vector assembler.
    3. In the Parameters column, enter the parameters you want to customize for use in the Vector assembler algorithm. In this scenario, enter inputCols=latitude,longitude.
    In this transformation, tModelEncoder combines all feature vectors into one single feature column.
  5. Double-click tKMeansModel to open its Component view.
  6. Select the Define a storage configuration component check box and select the tHDFSConfiguration component to be used.
  7. From the Vector to process list, select the column that provides the feature vectors to be analyzed. In this scenario, it is map, which combines all features.
  8. Select the Save the model on file system check box and in the HDFS folder field that is displayed, enter the directory you want to use to store the generated model.
  9. In the Number of cluster field, enter the number of decision trees you want tKMeans to build. You need to try different numbers to run the current Job to create the clustering model several times; after comparing the evaluation results of every model created on each run, you can decide the number you need to use. For example, put 6.
    You need to write the evaluation code yourself.
  10. From the Initialization function, select Random. In general, this mode is used for simple datasets.
  11. Leave the other parameters as they are.