tKMeansModel - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Warning

This component will be available in the Palette of the studio on the condition that you have subscribed to any Talend Platform product with Big Data.

Function

tKMeansModel analyzes incoming datasets based on applying the K-Means algorithm.

It generates a clustering model out of this analysis and writes this model either in memory or in a given file system.

Purpose

This component analyzes feature vectors usually pre-processed by tModelEncoder to generate a clustering model that is used by tPredictCluster to cluster given elements.

tKMeansModel properties in Spark Batch Jobs

Component family

Machine Learning / Clustering

 

Basic settings

Vector to process

Select the input column used to provide feature vectors. Very often, this column is the output of the feature engineering computations performed by tModelEncoder.

 

Save the model on file system

Select this check box to store the model in a given file system.

 

Number of clusters (K)

Enter the number of clusters into which you want tKMeansModel to group data.

In general, a large number of clusters can decreases errors in predictions but increases the risk of overfitting. Therefore, it is recommended to put a reasonable number based on how many potential clusters you think, by observation for example, the data to be processed might contain.

Set distance threshold of the convergence (Epsilon)

Select this check box and in the Epsilon field that is displayed, enter the convergence distance you want to use. The model training is considered accomplished once all of the cluster centers move less than this distance.

If you leave this check box clear, the default convergence distance 0.0001 is used.

 

Set the maximum number of runs

Select this check box and in the Maximum number of runs field that is displayed, enter the number of iterations you want the Job to perform to train the model.

If you leave this check box clear, the default value 20 is used.

 

Set the number of parallelized runs

Select this check box and in the Number of parallelized runs field that is displayed, enter the number of iterations you want the Job to run in parallel.

If you leave this check box clear, the default value 1 is used. This actually means that the iterations will be run in succession.

Note that this parameter helps you optimize the use of your resources for the computations but does not impact the prediction performance of the model.

 

Initialization function

Select the mode to be used to select the points as initial cluster centers.

  • Random: the points are selected randomly. In general, this mode is used for simple datasets.

  • K-Means||: this mode is known as Scalable K-Means++, a parallel algorithm that can obtain a nearly optimal initialization result. This is also the default initialization mode.

    For further information about this mode, see Scalable K-Means++.

 

Set the number of steps for the initialization

Select this check box and in the Steps field that is displayed, enter the number of initialization rounds to be run for the optimal initialization result.

If you leave this check box clear, the default value 5 is used. 5 rounds are almost always enough for the K-Means|| mode to obtain the optimal result.

 

Define the random seed

Select this check box and in the Seed field that is displayed, enter the seed to be used for the initialization of the cluster centers.

Advanced settings

Display the centers after the processing

Select this check box to output the vectors of the cluster centers into the console of the Run view.

This feature is often useful when you need to understand how the cluster centers move in the process of training your K-Means model.

Usage in Spark Batch Jobs

In a Talend Spark Batch Job, it is used as an end component and requires an input link. The other components used along with it must be Spark Batch components, too. They generate native Spark code that can be executed directly in a Spark cluster.

You can accelerate the training process by adjusting the stopping conditions such as the maximum number of runs or the convergence distance but note that the training that stops too early could impact its performance.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Related scenarios

No scenario is available for the Spark Batch version of this component yet.