Download the sets of SMS text messages from the Downloads tab in the left panel of this page:
- The set used to train the classification models: trainingSet.zip
- The set used to evaluate the created models: testSet.zip
Talend created these two sets out of the dataset downloadable from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection, by using the dataset preparation Job (dataset_preparation.zip) to add 3 feature columns (number of currency symbols, number of numeric values and number of exclamation marks) to the raw dataset and proportionally split the dataset.An example of the junk messages reads as follows:
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18'sAn example of the normal messages reads as follows:
Ahhh. Work. I vaguely remember that! What does it feel like? Lol
Note that the new features added to the raw dataset were discovered as the result of the observation of the junk messages used specifically in this scenario (these junk messages often contain prices and/or exclamation marks) and so cannot be generalized for whatever junk messages you want to analyze. In addition, the dataset was randomly split into two sets and used as is but in a real-world practice, you can continue to preprocess them using many different methods such as dataset balancing in order to better train your classification model.
The two sets must be stored in the machine where the Job is going to be executed, for example in the HDFS system of your Yarn cluster if you use the Spark Yarn client mode to run Talend Spark Jobs, and you have appropriate rights and permissions to read data from and write data in this system.
In this scenario, the Spark Yarn client will be used and the datasets are stored in the associated HDFS system.
The Spark cluster to be used must have been properly set up and is running.