ML feature-processing algorithms in Talend - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This table presents the feature processing algorithms you can use in the tModelEncoder component.

Warning

The streaming version of this component is available in the Palette of the studio on the condition that you have subscribed to Talend Real-time Big Data Platform or Talend Data Fabric.

Algorithm

Talend data type

from the input column

Talend data type

for the output column

Supported Talend Job

Available parameters1

Purpose

Possible scenario

HashingTF

Object

Vector

Spark Batch

Spark Streaming

  • numFeatures: the number of features that define the dimension of the feature vector.

    For example, you can enter numFeatures=220 to define the dimension. If you do not put any parameter, the default value, 220, is used.

The output vectors are sparse vectors. For example, a document reading "tModelEncoder transforms your data to features." can be transformed to (3,[0,1,2],[1.0,2.0,3.0]), if you put numFeatures=3.

For further information about how to read a sparse vector, see Local vector.

For further information about the Spark API of HashingTF, see ML HashingTF.

As a text-processing algorithm, HashingTF converts input data into fixed-length feature vectors to reflect the importance of a term (a word or a sequence of words) by calculating the frequency that these words in the input data appear.

In a Spark Batch Job, it is typically used along with the IDF (Inverse document frequency) algorithm to make the weight calculation more reliable. In the context of a Talend Spark Job, you need to put a second tModelEncoder to apply the Inverse document frequency algorithm on the output of the HashingTF computation.

The data must be already segmented before being sent to the HashingTF computation; therefore, if the data to be used has not been segmented, you need to use another tModelEncoder to apply the Tokenizer algorithm or the Regex tokenizer algorithm to prepare the data.

For further details about the HashingTF implementation in Spark, see HashingTF from the Spark documentation.

It can be used to prepare data for the Classification or the Clustering components from the Machine Learning family in order to create sentiment analysis model.

Inverse document frequency

Vector

Vector

Spark Batch

  • minDocFreq: the minimum number of documents that should contain a term. This number is the threshold to indicate that a term becomes relevant to the IDF computation.

    For example, if you put minDocReq=5, when only 4 documents contain a term, this term is considered irrelevant and no IDF is actually applied on it.

For further details about the Spark API of this IDF algorithm, see ML feature IDF.

As a text-processing algorithm, Inverse document frequency (IDF) is often used to process the output of the HashingTF computation in order to downplay the importance of the terms that appear in too many documents.

It requires a tModelEncoder component performing the HashingTF computation to provide input data.

For further details about the IDF implementation in Spark, see IDF from the Spark documentation.

It can be used to prepare data for the Classification or the Clustering components from the Machine Learning family in order to create sentiment analysis model.

Word2Vector

List

Vector

Spark Batch

  • maxIter: maximum number of iterations for obtaining the optimal result. For example, maxIter=5.

  • minCount: minimum number of times a token should appear to be included in the vocabulary of the Word2Vector model. The default is minCount=5.

  • numPartitions: number of partitions.

  • seed: the random seed number.

  • stepSize: size of the Step for each iteration. This defines the learning rate.

  • vectorSize: size of each feature vector. The default is vectorSize=100, with which 100 numeric values are calculated to identify a document.

If you need to set several parameters, separate these parameters using semicolons (;), for example, maxIter=5;minCount=4.

For further information about the Spark API of Word2Vector, see Word2Vec.

Word2Vector transforms a document into a feature vector, for use in other learning computations such as text similarity calculation.

For further details about the Word2Vector implementation in Spark, see Word2Vec from the Spark documentation.

It can be used to prepare data for the Classification or the Clustering components from the Machine Learning family in order to, for example, find similar user comments about a product.

Binarizer

Double

Double

Spark Batch

Spark Streaming

  • threshold: the threshold used to binarize continuous features. The features greater than the threshold are binarized to 1.0 and the features equal to or less than the threshold are binarized to 0.0.

    The default is threshold=0.0.

For further information about the Spark API of Binarizer, see ML Binarizer.

Using the given threshold, Binarizer transforms a feature to a binary feature of which the value is distributed be either 1.0 or 0.0.

For further details about the Binarizer implementation in Spark, see Binarizer from the Spark documentation.

It can be used to prepare data for the Classification or the Clustering components from the Machine Learning family in order to, for example, estimate a user comment indicates this user's satisfaction or dissatisfaction.

Bucketizer

Double

Double

Spark Batch

Spark Streaming

  • split: parameter used to segment continuous features into buckets. A bucket is a half-open range [x,y) defined by the boundary values (x and y) you give except the last bucket, which also includes y.

    For example, you can put split=Double.NEGATIVE_INFINITY, -0.5, 0.0, 0.5, Double.POSITIVE_INFINITY to segment values such as -0.5, 0.3, 0.0, 0.2. The Double.NEGATIVE_INFINITY and the Double.POSITIVE_INFINITY are recommended when you do not know the upper bound and the lower bound of the target column.

For further information about the Spark API of Bucketizer, see Bucketizer.

Bucketizer segments continuous features to a column of feature buckets using the boundary values you define.

For further details about the Bucketizer implementation in Spark, see Bucketizer from the Spark documentation.

It can be used to prepare categorical data for training classification or clustering models.

Normalizer

Vector

Vector

Spark Batch

Spark Streaming

  • p: the p-norm value used to standardize the feature vectors from the input flow to unit norm.

    The default is p=2, meaning to use the Euclidean norm.

For further information about the Spark API of Normalizer, see Normalizer.

Normalizer normalizes each vector of the input data to have unit norm so as to improve the performance of learning computations.

For further information about the Normalizer implementation in Spark, see Normalizer from the Spark documentation.

It can be used to normalize of the result of the TF-IDF computation in order to eventually improve the performance of text classification (by tLogicRegressionModel for example) or text clustering.

One hot encoder

Double

Vector

Spark Batch

Spark Streaming

  • dropLast: the boolean parameter used to determine whether to drop the last category.

    The default is dropLast=true, meaning that the last category is dropped with the result that the output vector for this category contains only 0 and each vector uses one bit less of storage space. This configuration allows you to save the storage space for the output vectors.

For further information about the Spark API of One hot encoder, see OneHotEncoder.

One hot encoder enables the algorithms that expect continuous features to use categorical features by mapping the column of label indices of the categorical features to a column of binary code.

You can use another tModelEncoder component with the String indexer algorithm to create this column of label indices.

For further information about the OneHotEncoder implementation in Spark, see OneHotEncoder from the Spark documentation.

It can be used to provide feature data to the Classification or the Clustering components, such as tLogicRegressionModel.

Polynomial expansion

Vector

Vector

Spark Batch

Spark Streaming

  • degree: the polynomial degree to expand. A higher-degree expansion of features often means more accuracy in the model you need to create, but note that too high a degree can lead to overfitting in the result of the predictive analysis based on the same model.

    The default is degree=2, meaning to expand the input features into a 2-degree polynomial space.

For further information about the Spark API of Polynomial expansion, see Polynomial expansion.

Polynomial expansion expands the input features so as to improve the performance of learning computations.

For further details about the PolynomialExpansion implementation in Spark, see PolynomialExpansion from the Spark documentation.

It can be used to process feature data for the Classification or the Clustering components, such as tLogicRegressionModel.

Regex tokenizer

String

Object

Spark Batch

Spark Streaming

  • gaps: the boolean parameter used to indicate whether the regex splits the input text using one or more whitespace characters (when gaps=true) or repetitively matches a token (when gaps=false).

    By default, this parameter is set to be true and the default delimiter is \\s+, which matches one or more characters.

  • pattern: the parameter used to set regex pattern that matches tokens out of the input text.

  • minTokenLength: the parameter used to filter matched tokens using a minimal length. The default value is 1, so as to avoid returning empty strings.

If you need to set several parameters, separate these parameters using semicolons (;), for example, gaps=true;minTokenLength=4.

For further information about the Spark API of Regex tokenizer, see RegexTokenizer.

Regex tokenizer performs advanced tokenization based on given regex patterns.

For further details about the RegexTokenizer implementation in Spark, see RegexTokenizer from the Spark documentation.

It is often used to process text in terms of text mining for the Classification or the Clustering components, such as tRandomForestModel, in order to create, for example, a spam filtering model.

Tokenizer

String

Object

Spark Batch

Spark Streaming

You do not need to set any additional parameters for Tokenizer.

For further information about the Spark API of Tokenizer, see Tokenizer.

Tokenizer breaks input text (often sentences) into individual terms (often words).

Note that these words are all convert to lowercase.

For further details about the Tokenizer implementation in Spark, see Tokenizer from the Spark documentation.

It is often used to process text in terms of text mining for the Classification or the Clustering components, such as tRandomForestModel, in order to create, for example, a spam filtering model.

Standard scaler

Vector

Vector

Spark Batch

  • withMean: the boolean parameter used to indicate whether to center each vector of feature data with mean (that is to say, subtract the mean of the feature numbers from each of these numbers) before scaling. Centering the data will build a dense output and so when the input data is sparse, it will raise exception.

    By default, this parameter is set to be false, meaning that no centering occurs.

  • withStd: the boolean parameter used to indicate whether to scale the input data to have unit standard deviation.

    By default, withStd is set to be true, meaning to normalize the input feature vectors to have unit standard deviation.

If you need to set several parameters, separate these parameters using semicolons (;), for example, withMean=true;withStd=true.

Note that if you set both parameters to be false, Standard scaler will actually do nothing.

For further information about the Spark API of Standard scaler, see StandardScaler.

Standard scaler standardizes each input vector to have unit standard deviation (unit variance), a common case of normal distribution. The standardized data can improve the convergence rate and prevent features with very large variances from exerting overly large influence during model training.

For further details about the StandardScaler implementation in Spark, see StandardScaler from the Spark documentation.

It can be used to prepare data for the Classification or the Clustering components, such as tKMeanModel.

String indexer

String

Double

Spark Batch

You do not need to set any additional parameters for String indexer.

For further information about the Spark API of String indexer, see StringIndexer.

String indexer generates indices for categorical features (string-type labels). These indices can be used by other algorithms such as One hot encoder to build equivalent continuous features.

The indices are ordered by frequencies and the most frequent label gets the index 0.

For further details about the StringIndexer implementation in Spark, see StringIndexer from the Spark documentation.

String indexer, along with One hot encoder, enables algorithms that expects continuous features to use categorical features.

Vector indexer

Vector

Vector

Spark Batch

  • maxCategories: the parameter used to set the threshold indicating whether a vector column represents categorical features or continuous features. For example, if you put maxCategories=2, the columns that contain more than 2 distinct values will be declared as continuous feature column and the other columns as categorical feature column.

    The default is maxCategories=20.

For further information about the Spark API of Vector indexer, see VectorIndexer.

Vector indexer identifies categorical feature columns based on your definition of the maxCategories parameter and indexes the categories from each of the identified columns, starting from 0. The other columns are declared as continuous feature columns and are not indexed.

For further details about the VectorIndexer implementation in Spark, see VectorIndexer from the Spark documentation.

Vector indexer gives indexes to categorical features so that algorithms such as the Decision Trees computations run by tRandomForestModel, can handle the categorical features appropriately.

Vector assembler

numeric types, boolean type and vector type

Vector

Spark Batch

Spark Streaming

  • inputCols: the parameter used to indicate the input columns to be combined into one single vector column. For example, you can put inputCols=id,country_code to combine the id column and the country_code column.

For further information about the Spark API of Vector assembler, see VectorAssembler.

Vector assembler combines selected input columns into one single vector column that can be used by other algorithms or machine learning computations that expect vector features.

Note that Vector assembler does not re-calculate the features taken from different columns. It only combines these feature columns into one single vector but keep the features as they are.

When you select Vector assembler, the Input column column of the Transformation table in the Basic settings view of tModelEncoder is deactivated and you need to use the inputCols parameter in the Parameters column to select the input columns to be combined.

For further details about the VectorAssembler implementation in Spark, see VectorAssembler from the Spark documentation.

Vector assembler prepares feature vectors for the Logistic Regression computations or the Decision Tree computations run by components such as tLogisticRegressionModel and tRandomForestModel.

1: If you do not set any parameters yourself, the default ones, if any, are used.