Procedure
- Add a tModelEncoder component to the right of tFileInputDelimited.
-
Connect tFileInputDelimited to tModelEncoder with a Main.
- Double-click tModelEncoder and choose the Component view.
- Click Sync columns to the right of Schema.
- Click on the ellipses to Edit Schema.
-
Add two new columns to the output: MyFeatures with the
type Vector and MyLabels with the
type Double.
- Click OK.
- Click the green arrow in the Basic settings tab of the Component view to add a new transformation.
- Under Transformation, choose RFormula (Spark 1.5+).
-
Add the following code in the Parameters field.
featuresCol=MyFeatures;labelCol=MyLabels;formula=conversion ~ age + jobtype + maritalstatus + educationlevel + indefault + hasmortgage + haspersonalloan + numcampaigncalls + priorcampaignoutcome
The two columns added to the schema,
MyFeatures
andMyLabels
are referenced here. The formula is standard syntax used in the programming language R, which is used for statistical computing and advanced graphics. For more information, see The R Project.In the sampling of the data, there were nine features and one target. In the R formula above, the target you want to predict is conversion, and it is on the left of the tilde. All columns to the right of the tilde are the features. the two remaining components,
featuresCol
andlabelCol
, are placeholders for the tuples and the feature labels.