Encoding training data - 7.3

Machine Learning

Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Studio
Data Governance > Third-party systems > Machine Learning components
Data Quality and Preparation > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
Last publication date


  1. Add a tModelEncoder component to the right of tFileInputDelimited.
  2. Connect tFileInputDelimited to tModelEncoder with a Main.
  3. Double-click tModelEncoder and choose the Component view.
  4. Click Sync columns to the right of Schema.
  5. Click on the ellipses to Edit Schema.
  6. Add two new columns to the output: MyFeatures with the type Vector and MyLabels with the type Double.
  7. Click OK.
  8. Click the green arrow in the Basic settings tab of the Component view to add a new transformation.
  9. Under Transformation, choose RFormula (Spark 1.5+).
  10. Add the following code in the Parameters field.
    featuresCol=MyFeatures;labelCol=MyLabels;formula=conversion ~ age + jobtype + maritalstatus + educationlevel + indefault + hasmortgage + haspersonalloan + numcampaigncalls + priorcampaignoutcome

    The two columns added to the schema, MyFeatures and MyLabels are referenced here. The formula is standard syntax used in the programming language R, which is used for statistical computing and advanced graphics. For more information, see The R Project.

    In the sampling of the data, there were nine features and one target. In the R formula above, the target you want to predict is conversion, and it is on the left of the tilde. All columns to the right of the tilde are the features. the two remaining components, featuresCol and labelCol, are placeholders for the tuples and the feature labels.