Encoding training data - Cloud - 8.0

Machine Learning

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Machine Learning components
Data Quality and Preparation > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
Last publication date
2024-02-20

Procedure

  1. Add a tModelEncoder component after tFileInputDelimited.
  2. Connect tFileInputDelimited to tModelEncoder with a Main row.
  3. Double-click tModelEncoder and choose the Component view.
  4. Click Sync columns to the right of Schema.
  5. Click the [...] button to the right of Edit Schema.
  6. Add two new columns to the output: MyFeatures with the type Vector and MyLabels with the type Double.
  7. Click OK.
  8. In the Basic settings tab of the Component view, click the button to add a new transformation.
  9. Under Transformation, choose RFormula (Spark 1.5+).
  10. Add the following code in the Parameters field.
    featuresCol=MyFeatures;labelCol=MyLabels;formula=conversion ~ age + jobtype + maritalstatus + educationlevel + indefault + hasmortgage + haspersonalloan + numcampaigncalls + priorcampaignoutcome

    The two columns added to the schema, MyFeatures and MyLabels are referenced here. The formula is standard syntax used in the programming language R, which is used for statistical computing and advanced graphics. For more information, see The R Project.

    In the sampling of the data, there were nine features and one target. In the R formula above, the target you want to predict is conversion, and it is on the left of the tilde. All columns to the right of the tilde are the features. the two remaining components, featuresCol and labelCol, are placeholders for the tuples and the feature labels.