Prerequisites: - 7.3

Machine Learning

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Machine Learning components
Data Quality and Preparation > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
Last publication date
2024-02-21
  • Download the sets of SMS text messages from the Downloads tab in the left panel of this page:
    • The set used to train the classification models: trainingSet.zip
    • The set used to evaluate the created models: testSet.zip

    Talend created these two sets out of the dataset downloadable from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection, by using the dataset preparation Job (dataset_preparation.zip) to add 3 feature columns (number of currency symbols, number of numeric values and number of exclamation marks) to the raw dataset and proportionally split the dataset.

    An example of the junk messages reads as follows:
    Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
    An example of the normal messages reads as follows:
    Ahhh. Work. I vaguely remember that! What does it feel like? Lol

    Note that the new features added to the raw dataset were discovered as the result of the observation of the junk messages used specifically in this scenario (these junk messages often contain prices and/or exclamation marks) and so cannot be generalized for whatever junk messages you want to analyze. In addition, the dataset was randomly split into two sets and used as is but in a real-world practice, you can continue to preprocess them using many different methods such as dataset balancing in order to better train your classification model.

  • The two sets must be stored in the machine where the Job is going to be executed, for example in the HDFS system of your Yarn cluster if you use the Spark Yarn client mode to run Talend Spark Jobs, and you have appropriate rights and permissions to read data from and write data in this system.

    In this scenario, the Spark Yarn client will be used and the datasets are stored in the associated HDFS system.

  • The Spark cluster to be used must have been properly set up and is running.