Running the decision tree model using test data

Running the decision tree model using test data - 7.3

Machine Learning

Version

7.3

Language

English

Product

Talend Big Data

Talend Big Data Platform

Talend Data Fabric

Talend Real-Time Big Data Platform

Module

Talend Studio

Content

Data Governance > Third-party systems > Machine Learning components

Data Quality and Preparation > Third-party systems > Machine Learning components

Design and Development > Third-party systems > Machine Learning components

Last publication date

2024-02-21

This section explains how to test your decision tree model and examine how it predicts the target variable.

Procedure

Create a new Big Data batch Job specifying Spark as the framework.
Copy the tHDFSConfiguration from the previous Job and paste it in the palette.
Copy the tFileInputDelimited from the previous Job and paste it in the palette.
In tFileInputDelimited, change the Folder/File value to point to the testing data.
The test data has the same schema as the training data. The only differences are the content details and the number of rows.
Add a tPredict component to the palette. Connect tFileInputDelimited to tPredict with a Main.
Double-click tPredict.
Select the Define a storage configuration component check box and choose tHDFSConfiguration.
Choose Decision Tree Model as Model Type.
Add the path to the model you created in the previous section.
Click the Sync columns button, then click the ellipsis to edit the schema.
The output panel adds a new column named label. This is the placeholder for the predicted value produced by the decision model.
Add a tReplace to the palette and connect tPredict to it with a Main.
Configure tReplace as follows.

The tReplace is needed to convert the prediction output from tPredict from a boolean representation (0.0,10) to the representation of the testing data (yes/no).
Add a tAggregateRow and connect tReplace to tAggregateRow with a Main.
Configure tAggregateRow as follows.
The Output column in the Operations section was chosen at random. age was not chosen for any particular reason other than facilitating a count for the Group by.

tAggregateRow is used to create summary statistics of model performances used in the next section.
Add a tLogRow to the palette and connect tAggregateRow to it.

Your Job should look as follows.
Run the Job.
As for the training Job you previously created, you can run this Job either locally or on the cluster.

Results

The expected outcome of this Job is a tabular summary that demonstrates model prediction versus the actual true outcome.

count (age)	conversion (actual outcome)	label (predicted outcome)
41	yes	no
12	no	yes
15	yes	yes
446	no	no

For a total of 514 test records, the output says the following:

The model incorrectly predicted (conversion = no) as true for 41 of the test cases
The model incorrectly predicted (conversion = no) as false for 12 of the test cases
The model accurately predicted (conversion = no) as false for 15 of the test cases
The model accurately predicted (conversion = no) as true for 446 of the test cases