Two sets of SMS text messages: one is used to train classification models and the other is used to evaluate the created models. You can download the train set from trainingSet.zip and the test set from testSet.zip.
Talend created these two sets out of the dataset downloadable from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection, by using this dataSet_preparation Job to add 3 feature columns (number of currency symbols, number of numeric values and number of exclamation marks) to the raw dataset and proportionally split the dataset.An example of the junk messages reads as follows:
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18'sAn example of the normal messages reads as follows:
Ahhh. Work. I vaguely remember that! What does it feel like? Lol
Note that the new features added to the raw dataset were discovered as the result of the observation of the junk messages used specifically in this scenario (these junk messages often contain prices and/or exclamation marks) and so cannot be generalized for whatever junk messages you want to analyze. In addition, the dataset was randomly split into two sets and used as is but in a real-world practice, you can continue to preprocess them using many different methods such as dataset balancing in order to better train your classification model.
The two sets must be stored in the machine where the Job is going to be executed, for example in the HDFS system of your Yarn cluster if you use the Spark Yarn client mode to run Talend Spark Jobs, and you have appropriate rights and permissions to read data from and write data in this system.
In this scenario, the Spark Yarn client will be used and the datasets are stored in the associated HDFS system.
The Spark cluster to be used must have been properly set up and is running.