Loading the test set into the Job - 7.0

Machine Learning

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Machine Learning components
Data Quality and Preparation > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
EnrichPlatform
Talend Studio

Procedure

  1. Double-click tFileInputDelimited to open its Component view.
  2. Select the Define a storage configuration component check box and select the tHDFSConfiguration component to be used.
    tFileInputDelimited uses this configuration to access the training set to be used.
  3. Click the [...] button next to Edit schema to open the schema editor.
  4. Click the [+] button five times to add five rows and in the Column column, rename them to reallabel, sms_contents, num_currency, num_numeric and num_exclamation, respectively.
    The reallabel and the sms_contents columns carries the raw data which is composed of the SMS text messages in the sms_contents column and the labels indicating whether a message is spam in the reallabel column.
    The other columns are used to carry the features added to the raw datasets as explained previously in this scenario. They contains the number of currency symbols, the number of numeric values and the number of exclamation marks found in each SMS message.
  5. In the Type column, select Integer for the num_currency, num_numeric and num_exclamation columns.
  6. Click OK to validate these changes.
  7. In the Folder/File field, enter the directory where the test set to be used is stored.
  8. In the Field separator field, enter \t, which is the separator used by the datasets you can download for use in this scenario.