Creating a classification model to filter spam - 6.4

Machine Learning

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Machine Learning components
Data Quality and Preparation > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
EnrichPlatform
Talend Studio

This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.

For more technologies supported by Talend, see Talend components.

In this scenario, you create Spark Batch Jobs. The key components to be used are as follows:
  • tModelEncoder: several tModelEncoder components are used to transform given SMS text messages into feature sets.

  • tRandomForestModel: it analyzes the features incoming from tModelEncoder to build a classification model that understands what a junk message or a normal message could look like.

  • tClassify: in a new Job, it applies this classification model to process a new set of SMS text messages to classify the spam and the normal messages. In this scenario, the result of this classification is used to evaluate the accuracy of the model, since the classification of the messages processed by tClassify is already known and explicitly marked.

  • A configuration component such as tHDFSConfiguration in each Job: this component is used to connect to the file system to which the jar files dependent on the Job are transferred during the execution of the Job.

    This file-system-related configuration component is required unless you run your Spark Jobs in the Local mode.