Creating a classification model to filter spam - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
EnrichPlatform
Talend Studio
task
Data Governance
Data Quality and Preparation
Design and Development

In this scenario, you create Spark Batch Jobs. The key components to be used are as follows:

  • tModelEncoder: several tModelEncoder components are used to transform given SMS text messages into feature sets.

  • tRandomForestModel: it analyzes the features incoming from tModelEncoder to build a classification model that understands what a junk message or a normal message could look like.

  • tClassify: in a new Job, it applies this classification model to process a new set of SMS text messages to classify the spam and the normal messages. In this scenario, the result of this classification is used to evaluate the accuracy of the model, since the classification of the messages processed by tClassify is already known and explicitly marked.

  • A configuration component such as tHDFSConfiguration in each Job: this component is used to connect to the file system to which the jar files dependent on the Job are transferred during the execution of the Job.

    This file-system-related configuration component is required unless you run your Spark Jobs in the Local mode.

Prerequisites:

  • Two sets of SMS text messages: one is used to train classification models and the other is used to evaluate the created models. You can download the train set from trainingSet.zip and the test set from testSet.zip.

    Talend created these two sets out of the dataset downloadable from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection, by using this dataSet_preparation Job to add 3 feature columns (number of currency symbols, number of numeric values and number of exclamation marks) to the raw dataset and proportionally split the dataset.

    An example of the junk messages reads as follows:

    Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

    An example of the normal messages reads as follows:

    Ahhh. Work. I vaguely remember that! What does it feel like? Lol

    Note that the new features added to the raw dataset were discovered as the result of the observation of the junk messages used specifically in this scenario (these junk messages often contain prices and/or exclamation marks) and so cannot be generalized for whatever junk messages you want to analyze. In addition, the dataset was randomly split into two sets and used as is but in a real-world practice, you can continue to preprocess them using many different methods such as dataset balancing in order to better train your classification model.

  • The two sets must be stored in the machine where the Job is going to be executed, for example in the HDFS system of your Yarn cluster if you use the Spark Yarn client mode to run Talend Spark Jobs, and you have appropriate rights and permissions to read data from and write data in this system.

    In this scenario, the Spark Yarn client will be used and the datasets are stored in the associated HDFS system.

  • The Spark cluster to be used must have been properly set up and is running.

Creating a classification model using Random Forest

Linking the components
  1. In the Integration perspective of the Studio, create an empty Spark Batch Job, named rf_model_creation for example, from the Job Designs node in the Repository tree view.

    For further information about how to create a Spark Batch Job, see the Getting Started Guide of the Studio.

  2. In the workspace, enter the name of the component to be used and select this component from the list that appears. In this scenario, the components are tHDFSConfiguration, tFileInputDelimited, tRandomForestModel component, and 4 tModelEncoder components.

    It is recommended to label the 4 tModelEncoder components to different names so that you can easily recognize the task each of them is used to complete. In this scenario, they are labelled Tokenize, tf, tf_idf and features_assembler, respectively.

  3. Except tHDFSConfiguration, connect the other components using the Row > Main link as is previously displayed in the image.

Configuring the connection to the file system to be used by Spark
  1. Double-click tHDFSConfiguration to open its Component view. Note that tHDFSConfiguration is used because the Spark Yarn client mode is used to run Spark Jobs in this scenario.

    Spark uses this component to connect to the HDFS system to which the jar files dependent on the Job are transferred.

  2. In the Version area, select the Hadoop distribution you need to connect to and its version.

  3. In the NameNode URI field, enter the location of the machine hosting the NameNode service of the cluster.

  4. In the Username field, enter the authentication information used to connect to the HDFS system to be used. Note that the user name must be the same as you have put in the Spark configuration tab.

Loading the training set into the Job
  1. Double-click tFileInputDelimited to open its Component view.

  2. Select the Define a storage configuration component check box and select the tHDFSConfiguration component to be used.

    tFileInputDelimited uses this configuration to access the training set to be used.

  3. Click the [...] button next to Edit schema to open the schema editor.

  4. Click the [+] button five times to add five rows and in the Column column, rename them to label, sms_contents, num_currency, num_numeric and num_exclamation, respectively.

    The label and the sms_contents columns carries the raw data which is composed of the SMS text messages in the sms_contents column and the labels indicating whether a message is spam in the label column.

    The other columns are used to carry the features added to the raw datasets as explained previously in this scenario. These three features are the number of currency symbols, the number of numeric values and the number of exclamation marks found in each SMS message.

  5. In the Type column, select Integer for the num_currency, num_numeric and num_exclamation columns.

  6. Click OK to validate these changes.

  7. In the Folder/File field, enter the directory where the training set to be used is stored.

  8. In the Field separator field, enter \t, which is the separator used by the datasets you can download for use in this scenario.

Transforming SMS text messages to feature vectors using tModelEncoder

This step is meant to implement the feature engineering process.

Transforming messages to words

  1. Double-click the tModelEncoder component labelled Tokenize to open its Component view. This component tokenize the SMS messages into words.

  2. Click the Sync columns button to retrieve the schema from the preceding one.

  3. Click the [...] button next to Edit schema to open the schema editor.

  4. On the output side, click the [+] button to add one row and in the Column column, rename it to sms_tokenizer_words. This column is used to carry the tokenized messages.

  5. In the Type column, select Object for this sms_tokenizer_words row.

  6. Click OK to validate these changes.

  7. In the Transformations table, add one row by clicking the [+] button and then proceed as follows:

    • In the Input column column, select the column that provides data to be transformed to features. In this scenario, it is sms_contents.

    • In the Output column column, select the column that carry the features. In this scenario, it is sms_tokenizer_words.

    • In the Transformation column, select the algorithm to be used for the transformation. In this scenario, it is Regex tokenizer.

    • In the Parameters column, enter the parameters you want to customize for use in the algorithm you have selected. In this scenario, enter pattern=\\W;minTokenLength=3.

    Using this transformation, tModelEncoder splits each input message by whitespace, selects only the words contains at least 3 letters and put the result of the transformation in the sms_tokenizer_words column. Thus currency symbols, numeric values, punctuations and words such as a, an or to are excluded from this column.

Calculating the weight of a word in each message

  1. Double-click the tModelEncoder component labelled tf to open its Component view.

  2. Repeat the operations described previously over the tModelEncoder labelled Tokenizer to add the sms_tf_vect column of the Vector type to the output schema and define the transformation as displayed in the image above.