tKMeansStrModel - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Warning

The streaming version of this component is available in the Palette of the studio on the condition that you have subscribed to Talend Real-time Big Data Platform or Talend Data Fabric.

Function

tKMeansStrModel analyzes incoming datasets in near real-time, based on applying the K-Means algorithm.

It continuously updates a K-Means clustering model out of this analysis and writes this model either in memory or in a given file system.

Purpose

This component analyzes streaming feature vectors to continuously adapt an existing clustering model to changing circumstances. The incoming data is usually pre-processed by tModelEncoder and the K-Means model is used by tPredictCluster to cluster given elements.

tKMeansStrModel properties in Spark Streaming Jobs

Component family

Machine Learning / Clustering

 

Basic settings

Save on disk

Select this check box to store the clustering model in an HDFS directory you put in the Path field.

In this case, you need to enter the time interval (in minutes) at the end of which the model is saved.

If you clear this check box, your model will be stored in memory.

 

Path

Select this check box to store the model in a given file system.

In the Path field, enter the HDFS directory to be used.

This field is available when you select the check boxes used to save a model to or read a model from a file system.

 

Load a precomputed model from disk

Select this check box to use an existing K-Means model stored in the directory you have specified in the Path field. This is the common case when using tKMeansStrModel. In this situation, the following behaviors can be expected:

  • If you select the Reuse the model transformation associated with the model check box, tKMeansStrModel reuses, along with this model to be used, the feature pre-processing algorithms that were previously implemented during the creation of this model. This reuse allows tKMeansStrModel to directly transform new incoming data into K-Means compliant feature vectors and process these vectors, without having to wait for another implementation of the same algorithms.

    However, with this option activated, you need to check the schema of the data that was transformed by these feature pre-processing algorithms and ensure that the new input data to tKMeansStrModel uses the same schema.

    You can simply see this schema in the Job which initially implemented these feature pre-processing algorithms.

  • If you clear the Reuse the model transformation associated with the model check box, you need to place one or several tModelEncoder components in front of tKMeansStrModel to transform the incoming data to feature vectors required by K-Means. Then select the column that provides these feature vectors from the Vector to process drop-down list that is displayed.

    For further information about tModelEncoder, see tModelEncoder.

  • If the model to be loaded does not actually exist, tKMeansStrModel will automatically initialize 2 clusters to create a K-Means model.

If you clear this Load a precomputed model from disk check box, tKMeansStrModel will create a new K-Means model from scratch.

 

Vector to process

Select the input column used to provide feature vectors. Very often, this column is the output of the feature engineering computations performed by tModelEncoder.

This list appears when you have cleared either the Load a precomputed model from disk check box or the Reuse the model transformation associated with the model check box.

 

Size of your feature vector

Enter the size of the feature vectors to be processed from the column you have selected from the Vector to process list.

 

Display the vector size

Select this check box to display the feature vectors to be used in the console of the Run view.

This feature will slow down your Job but is useful when you do not know what value to be entered in the Size of your feature vector field.

 

Number of clusters (K)

Enter the number of clusters into which you want tKMeansModel to cluster data.

In general, a large number of clusters can decreases errors in predictions but increases the risk of overfitting.

This field appears when you have cleared the Load a precomputed model from disk check box to create a K-Means model from scratch.

Decay factor

Enter the decay rate (ranging between 0 and 1) to be applied to discount the weight of existing points against the new incoming points in the process of evaluating new cluster centers.

Lower decay rate means more importance to be attached to the new incoming data. When decay rate is 0, new cluster centers are determined completely by the new points; when decay rate is 1, the existing points and new incoming points are evaluated equally.

 

Time unit

Select the unit on which the decay rate is applied: point or batch of points.

Advanced settings

Display the centers after the processing

Select this check box to output the vectors of the cluster centers into the console of the Run view.

This feature is often useful when you need to understand how the cluster centers move in the process of training your K-Means model.

Usage in Spark Streaming Jobs

In a Talend Spark Streaming Job, it is used as an end component and requires an input link. The other components used along with it must be Spark Streaming components, too. They generate native Spark code that can be executed directly in a Spark cluster.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Related scenarios

No scenario is available for the Spark Streaming version of this component yet.