This section explains how to test your decision tree model and examine how it predicts
the target variable.
Procedure
-
Create a new Big Data Batch Job specifying Spark as the framework.
-
From the previous Job, copy-paste tHDFSConfiguration and
tFileInputDelimited.
-
In tFileInputDelimited, change the Folder/File value to point to the testing data.
The test data has the same schema as the training data. The only differences are the content details and the number of rows.
-
Add a tPredict component to the workspace. Connect
tFileInputDelimited to tPredict with
a Main row.
-
Double-click tPredict to open the Basic
settings.
-
Select the Define a storage configuration component check box and choose tHDFSConfiguration.
-
In Model Type, choose Decision Tree
Model.
-
Add the path to the model you created in the previous section.
-
Click the Sync columns button, then click the
[...] button to edit the schema.
The output panel adds a new column named label. This is the placeholder for the
predicted value produced by the decision model.
-
Add a tReplace to the workspace and connect
tPredict to it with a Main
row.
-
Configure tReplace as follows.
The tReplace is needed to convert the prediction output
from tPredict from a boolean representation (0.0,10)
to the representation of the testing data (yes/no).
-
Add a tAggregateRow and connect
tReplace to it with a Main
row.
tAggregateRow is used to create summary statistics of model
performances used in the next section.
-
Configure tAggregateRow as follows.
The Output column in the Operations
section was chosen randomly. age was not chosen for any
particular reason other than facilitating a count for the Group
by.
-
Add a tLogRow to the workspace and connect
tAggregateRow to it with a Main
row.
Here is the Job configuration.
-
Run the Job.
As for the training Job you previously created, you can run this Job either locally or on the cluster.
Results
The expected outcome of this Job is a tabular summary that demonstrates model prediction versus the actual true outcome.
count (age) |
conversion (actual outcome) |
label (predicted outcome) |
41 |
yes |
no |
12 |
no |
yes |
15 |
yes |
yes |
446 |
no |
no |
For a total of 514 test records, the output says the following:
- The model incorrectly predicted:
- (conversion = no) as true for 41 of the test cases
- (conversion = no) as false for 12 of the test cases
- The model accurately predicted:
- (conversion = no) as false for 15 of the test cases
- (conversion = no) as true for 446 of the test cases