tRandomForestModel - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Warning

This component is available in the Palette of the Studio only if you have subscribed to any Talend Platform product with Big Data or Talend Data Fabric.

Function

tRandomForestModel analyzes incoming datasets based on applying the Random Forest algorithm.

It generates a classification model out of this analysis and writes this model either in memory or in a given file system.

Purpose

This component analyzes feature vectors usually pre-processed by tModelEncoder to generate a classifier model that is used by tPredict to classify given elements.

tRandomForestModel properties in Spark Batch Jobs

Component family

Machine Learning / Classification

 

Basic settings

Label column

Select the input column used to provide classification labels. The records of this column are used as the class names (Target in terms of classification) of the elements to be classified.

 

Feature column

Select the input column used to provide features. Very often, this column is the output of the feature engineering computations performed by tModelEncoder.

 

Save the model on file system

Select this check box to store the model in a given file system. Otherwise, the model is stored in memory. The button for browsing does not work with the Spark Local mode; if you are using the Spark Yarn or the Spark Standalone mode, ensure that you have properly configured the connection in a configuration component in the same Job, such as tHDFSConfiguration.

 

Number of trees in the forest

Enter the number of decision trees you want tRandomForestModel to build.

Each decision tree is trained independently using a random sample of features.

Increasing this number can improve the accuracy by decreasing the variance in predictions, but will increase the training time.

Maximum depth of each tree in the forest

Enter the decision tree depth at which the training should stop adding new nodes. New nodes represent further tests on features on internal nodes and possible class labels held by leaf nodes.

For a tree of n depth, the number of internal nodes is 2n - 1. For example, depth 1 means 1 internal node plus 2 leaf nodes.

Generally speaking, a deeper decision tree is more expressive and thus potentially more accurate in predictions, but it is also more resource consuming and prone to overfitting.

Advanced settings

Subsampling rate

Enter the numeric value to indicate the fraction of the input dataset used for training each tree in the forest. The default value 1.0 is recommended, meaning to take the whole dataset for test.

 

Subset strategy

Select the strategy about how many features should be considered on each internal node in order to appropriately split this internal node (actually the training set or subset of a feature on this node) into smaller subsets. These subsets are used to build child nodes.

Each strategy takes a different number of features into account to find the optimal point among these features for split. This point could be, for example, the age 35 of the categorical feature age.

  • auto: this strategy is based on the number of trees you have set in the Number of trees in the forest field in the Basic settings view. This is the default strategy to be used.

    If the number of trees is 1, the strategy is actually all; if this number is greater than 1, the strategy is sqrt.

  • all: the total number of features is considered for split.

  • sqrt: the number of features to be considered is the square root of the total number of features.

  • log2: the number of features to be considered is the result of log2(M), in which M is the total number of features.

 

Max bins

Enter the numeric value to indicate the maximum number of bins used for splitting features.

The continuous features are automatically transformed to ordered discrete features.

 

Min info gain

Enter the minimum number of information gain to be expected from a parent node to its child nodes. When the number of information gain is less than this minimum number, node split is stopped.

The default value of the minimum number of information gain is 0.0, meaning that no further information is obtained by splitting a given node. As a result, the splitting could be stopped.

For further information about how the information gain is calculated, see Impurity and Information gain from the Spark documentation.

 

Min instances per node

Enter the minimum number of training instances a node should have to make it valid for further splitting.

The default value is 1, which means when a node has only 1 row of training data, it stops splitting.

 

Impurity

Select the measure used to select the best split from each set of splits.

  • gini: it is about how often an element could be incorrectly labelled in a split.

  • entropy: it is about how unpredictable the information in each split is.

For further information about how each of the measures is calculated, see Impurity measures from the Spark documentation.

 

Set a random seed

Enter the random seed number to be used for bootstrapping and choosing feature subsets

Usage in Spark Batch Jobs

This component is used as an end component and requires an input link.

You can accelerate the training process by adjusting the stopping conditions such as the maximum depth of each decision tree, the maximum number of bins of splitting or the minimum number of information gain, but note that the training that stops too early could impact its performance.

Model evaluation

The parameters you need to set are free parameters and so their values may be provided by previous experiments, empirical guesses or the like. They do not have any optimal values applicable for all datasets.

Therefore, you need to train the classifier model you are generating with different sets of parameter values until you can obtain the best confusion matrix. But note that you need to write the evaluation code yourself to rank your model with scores.

You need to select the scores to be used depending on the algorithm you want to use to train your classifier model. This allows you to build the most relevant confusion matrix.

For examples about how the confusion matrix is used in a Talend Job for classification, see Creating a classification model to filter spam.

For a general explanation about confusion matrix, see https://en.wikipedia.org/wiki/Confusion_matrix from Wikipedia.

Spark Connection

You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, one and only one file system related component from the Storage family is required in the same Job so that Spark can use this component to connect to the file system to which the jar files dependent on the Job are transferred:

This connection is effective on a per-Job basis.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Creating a classification model to filter spam

In this scenario, you create Spark Batch Jobs. The key components to be used are as follows:

  • tModelEncoder: several tModelEncoder components are used to transform given SMS text messages into feature sets.

  • tRandomForestModel: it analyzes the features incoming from tModelEncoder to build a classification model that understands what a junk message or a normal message could look like.

  • tClassify: in a new Job, it applies this classification model to process a new set of SMS text messages to classify the spam and the normal messages. In this scenario, the result of this classification is used to evaluate the accuracy of the model, since the classification of the messages processed by tClassify is already known and explicitly marked.

  • A configuration component such as tHDFSConfiguration in each Job: this component is used to connect to the file system to which the jar files dependent on the Job are transferred during the execution of the Job.

    This file-system-related configuration component is required unless you run your Spark Jobs in the Local mode.

Prerequisites:

  • Two sets of SMS text messages: one is used to train classification models and the other is used to evaluate the created models. You can download the train set from trainingSet.zip and the test set from testSet.zip.

    Talend created these two sets out of the dataset downloadable from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection, by using this dataSet_preparation Job to add 3 feature columns (number of currency symbols, number of numeric values and number of exclamation marks) to the raw dataset and proportionally split the dataset.

    An example of the junk messages reads as follows:

    Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

    An example of the normal messages reads as follows:

    Ahhh. Work. I vaguely remember that! What does it feel like? Lol

    Note that the new features added to the raw dataset were discovered as the result of the observation of the junk messages used specifically in this scenario (these junk messages often contain prices and/or exclamation marks) and so cannot be generalized for whatever junk messages you want to analyze. In addition, the dataset was randomly split into two sets and used as is but in a real-world practice, you can continue to preprocess them using many different methods such as dataset balancing in order to better train your classification model.

  • The two sets must be stored in the machine where the Job is going to be executed, for example in the HDFS system of your Yarn cluster if you use the Spark Yarn client mode to run Talend Spark Jobs, and you have appropriate rights and permissions to read data from and write data in this system.

    In this scenario, the Spark Yarn client will be used and the datasets are stored in the associated HDFS system.

  • The Spark cluster to be used must have been properly set up and is running.

Creating a classification model using Random Forest

Linking the components
  1. In the Integration perspective of the Studio, create an empty Spark Batch Job, named rf_model_creation for example, from the Job Designs node in the Repository tree view.

    For further information about how to create a Spark Batch Job, see the Getting Started Guide of the Studio.

  2. In the workspace, enter the name of the component to be used and select this component from the list that appears. In this scenario, the components are tHDFSConfiguration, tFileInputDelimited, tRandomForestModel component, and 4 tModelEncoder components.

    It is recommended to label the 4 tModelEncoder components to different names so that you can easily recognize the task each of them is used to complete. In this scenario, they are labelled Tokenize, tf, tf_idf and features_assembler, respectively.

  3. Except tHDFSConfiguration, connect the other components using the Row > Main link as is previously displayed in the image.

Configuring the connection to the file system to be used by Spark
  1. Double-click tHDFSConfiguration to open its Component view. Note that tHDFSConfiguration is used because the Spark Yarn client mode is used to run Spark Jobs in this scenario.

    Spark uses this component to connect to the HDFS system to which the jar files dependent on the Job are transferred.

  2. In the Version area, select the Hadoop distribution you need to connect to and its version.

  3. In the NameNode URI field, enter the location of the machine hosting the NameNode service of the cluster.

  4. In the Username field, enter the authentication information used to connect to the HDFS system to be used. Note that the user name must be the same as you have put in the Spark configuration tab.

Loading the training set into the Job
  1. Double-click tFileInputDelimited to open its Component view.

  2. Select the Define a storage configuration component check box and select the tHDFSConfiguration component to be used.

    tFileInputDelimited uses this configuration to access the training set to be used.

  3. Click the [...] button next to Edit schema to open the schema editor.

  4. Click the [+] button five times to add five rows and in the Column column, rename them to label, sms_contents, num_currency, num_numeric and num_exclamation, respectively.

    The label and the sms_contents columns carries the raw data which is composed of the SMS text messages in the sms_contents column and the labels indicating whether a message is spam in the label column.

    The other columns are used to carry the features added to the raw datasets as explained previously in this scenario. These three features are the number of currency symbols, the number of numeric values and the number of exclamation marks found in each SMS message.

  5. In the Type column, select Integer for the num_currency, num_numeric and num_exclamation columns.

  6. Click OK to validate these changes.

  7. In the Folder/File field, enter the directory where the training set to be used is stored.

  8. In the Field separator field, enter \t, which is the separator used by the datasets you can download for use in this scenario.

Transforming SMS text messages to feature vectors using tModelEncoder

This step is meant to implement the feature engineering process.

Transforming messages to words

  1. Double-click the tModelEncoder component labelled Tokenize to open its Component view. This component tokenize the SMS messages into words.

  2. Click the Sync columns button to retrieve the schema from the preceding one.

  3. Click the [...] button next to Edit schema to open the schema editor.

  4. On the output side, click the [+] button to add one row and in the Column column, rename it to sms_tokenizer_words. This column is used to carry the tokenized messages.

  5. In the Type column, select Object for this sms_tokenizer_words row.

  6. Click OK to validate these changes.

  7. In the Transformations table, add one row by clicking the [+] button and then proceed as follows:

    • In the Input column column, select the column that provides data to be transformed to features. In this scenario, it is sms_contents.

    • In the Output column column, select the column that carry the features. In this scenario, it is sms_tokenizer_words.

    • In the Transformation column, select the algorithm to be used for the transformation. In this scenario, it is Regex tokenizer.

    • In the Parameters column, enter the parameters you want to customize for use in the algorithm you have selected. In this scenario, enter pattern=\\W;minTokenLength=3.

    Using this transformation, tModelEncoder splits each input message by whitespace, selects only the words contains at least 3 letters and put the result of the transformation in the sms_tokenizer_words column. Thus currency symbols, numeric values, punctuations and words such as a, an or to are excluded from this column.

Calculating the weight of a word in each message

  1. Double-click the tModelEncoder component labelled tf to open its Component view.

  2. Repeat the operations described previously over the tModelEncoder labelled Tokenizer to add the sms_tf_vect column of the Vector type to the output schema and define the transformation as displayed in the image above.

    In this transformation, tModelEncoder uses HashingTF to convert the already tokenized SMS messages into fixed-length (15 in this scenario) feature vectors to reflect the importance of a word in each SMS message.

Downplaying the weight of the irrelevant words in each message

  1. Double-click the tModelEncoder component labelled tf_idf to open its Component view. In this process, tModelEncoder reduces the weight of the words that appears very often but in too many messages, because a word like this often brings no meaningful information for text analysis, such as the word the.

  2. Repeat the operations described previously over the tModelEncoder labelled Tokenizer to add the sms_tf_idf_vect column of the Vector type to the output schema and define the transformation as displayed in the image above.