Machine Learning 101 - Decision Trees
This hands on tutorial demonstrates the basics of developing a machine learning routine using Talend and Spark. Specifically, decision tree learning will be leveraged for classification of real-life bank marketing data. Upon completion, you will have a working knowledge of how machine learning is integrated into a Talend workflow and some re-usable code snippets.
The source data used in this tutorial was retrieved from the UCI Machine Learning Repository. Irvine, CA: University of California, Schools of Information and Computer Science. It is available in the public domain and is attributed to: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. "A Data-Driven Approach to Predict the Success of Bank Telemarketing." Decision Support Systems, Elsevier, 62:22-31, June 2014: Bank Marketing Data Set
- You have Hortonworks 2.4 (HDP) installed and configured. You can also use Hortonworks sandbox, a downloadable virtual machine (VM). For more information, see Hortonworks - Getting Started
- You have basic knowledge of Hadoop ecosystem's tools and technologies.
- You have basic knowledge of Hadoop Distributed File System (HDFS) and Spark.
- You have working knowledge of Talend Studio and Talend Big Data Platform.
- You have Talend Big Data Platform installed and configured. Any license model above this platform also work.
Creating a Spark Job for machine learning
- Open Talend Studio and expand Job Designs.
Right-click Big Data Batch and create a new job specifying Spark as the framework.
Creating a Hadoop cluster for machine learning
- Expand Metadata.
Right-click Hadoop Cluster and create a new cluster.
Specify a Linux OS user on the cluster.
Here, the user puccini was already created.
Training and test data used in this article have been slightly modified from the original source and pre-loaded into HDFS. Those data sets can be downloaded below.
Configure the HDFS connection as follows.
Sampling of Machine Learning Repository data
This section details a sample of the data used in this tutorial.
This tutorial is not intended to teach data science or to detail a formal data analysis, but it is helpful to see a sample of the data.
For more information about this dataset, see UCI Machine Learning Repository.
There are ten variables, nine independent and one dependent:
- Independent: age, jobtype, maritalstatus, educationlevel, indefault, hasmortgage, haspersonalloan, numcampaigncalls ,priorcampaignoutcome
- Dependent: conversion
The independent variables, also known as feature variables, are used to predict an outcome. The dependent variable, or target variable, is what you want to predict. The sampling of data above demonstrates tuples that contain features and a target variable, both of which are needed to train your decision tree model. This type of training is called supervised learning, because the data contains both an output vector of features and a known output value.
The following steps use the training data to build a decision tree model using Spark's Machine Learning Library (MLlib). In simple terms, the goal is to determine how well the features can predict the target variable conversion using the training data, which comprises 1546 data points.
You also need to understand the overall shape and distribution of the data to ensure downstream assumptions are as accurate as possible. The following are summary statistics for the training dataset used in this article.
The levels (yes, no, failure, etc.) are reported for each categorical variable. For numerical data, the quartiles are reported. The target variable conversion has two levels, yes and no, and you can see that no appears a lot more often than yes. This imbalance presents some challenges when building a classifier model like the decision tree you are building. However, these challenges and associated mitigations are out of scope for this tutorial and will not be discussed. For more information, see Decision tree accuracy: effect of unbalanced data.
What needs to be mentioned is that the model you build will predict (conversion = no) as either being true or false. The interpretation of (conversion = no) as false in the context of the model is that (conversion = yes) is true.
Creating a training data schema reference
- Right-click the HDFS connection you previously created and choose Retrieve Schema.
Navigate to the pre-loaded training data file located at /user/puccini/machinelearning/decisiontrees/marketing/marketing_campaign_train.csv.
Click Next, name the schema and adjust the data types as needed.
In this case, the defaults are accurate.
- Click Finish.
- Add a tHDFSConfiguration component to the palette.
- Set Property Type to Repository.
Select the HDFS connection you created, MarketingCampaignData.
Accessing training data
- Add a tFileDelimitedInput component to the palette.
- Set the Property Type to Repository, then choose HDFS:MarketingCampaignData.
- Click the ellipsis to the right of Folder/File and navigate to the training dataset in HDFS, in this case it is located at /user/puccini/machinelearning/marketing/marketing_campaign_train.csv.
For Schema, choose Repository and select the schema you created earlier.
Encoding training data
- Add a tModelEncoder component to the right of tFileInputDelimited.
Connect tFileInputDelimited to tModelEncoder with a Main.
- Double-click tModelEncoder and choose the Component view.
- Click Sync columns to the right of Schema.
- Click on the ellipses to Edit Schema.
Add two new columns to the output: MyFeatures with the
type Vector and MyLabels with the
- Click OK.
- Click the green arrow in the Basic settings tab of the Component view to add a new transformation.
- Under Transformation, choose RFormula (Spark 1.5+).
Add the following code in the Parameters field.
featuresCol=MyFeatures;labelCol=MyLabels;formula=conversion ~ age + jobtype + maritalstatus + educationlevel + indefault + hasmortgage + haspersonalloan + numcampaigncalls + priorcampaignoutcome
The two columns added to the schema,
MyLabelsare referenced here. The formula is standard syntax used in the programming language R, which is used for statistical computing and advanced graphics. For more information, see The R Project.
In the sampling of the data, there were nine features and one target. In the R formula above, the target you want to predict is conversion, and it is on the left of the tilde. All columns to the right of the tilde are the features. the two remaining components,
labelCol, are placeholders for the tuples and the feature labels.
Training the decision tree model
- Add a tDecisionTreeModel component to the palette.
- Connect tModelEncouder to tDecisionTreeModel with a Main.
- Double-click tDecisionTreeModel and choose the Component view.
- Select the check box below Storage to choose HDFS storage.
- Choose the schema you created earlier.
- In Features Column, choose MyFeatures.
- In Label Column, choose MyLabels.
- Select the check box below Model location and save the HDFS file system at /user/puccini/machinelearning/decisiontrees/marketing/decisiontree.model.
Leave the default value for the rest of the settings.
Your final job should look as follows.
- Click the Run tab and go to Spark Configuration.
Select the Use local mode check box.
You can also run this job directly on the Hadoop cluster, which is the most likely scenario in a production setting. For that, you need to make a few small adjustments to how the job runs, including clearing the Use local mode check box.
Configuring your Job to run on the Hadoop cluster
- Click Spark Configuration on the Run tab.
Add the following Advanced properties.
The value is specific to the distribution and version of Hadoop. This tutorial uses Hortonworks 2.4 V3, which is 22.214.171.124-169. Your entry for this parameter will be different if you do not use Hortonworks 2.4 V3.Note:
When running the code on the cluster, it is crucial to ensure that there is unfettered access between the two systems. In this example, you have to ensure that the Hortonworks cluster can communicate with your instance of Talend Studio. This is necessary because Spark, even though it is running on the cluster, still needs to reference the Spark drivers shipped with Talend. Moreover, if you deploy a Spark Job into a production environment, it will be run from a Talend Job server (edge node). You also need to ensure that there is unfettered communication between it and the cluster.
For more information on the ports needed by each service, see the Spark Security documentation.
Click the Advanced settings tab and add a new JVM argument that indicates the version of Hadoop. It is the string you added as value in the previous step.
Click the Basic Run tab, then click Run.
When it is complete, you are prompted by a message indicating success.
Navigate to the HDFS directory, Ambari in this case, to verify that the model was created and persisted to HDFS.
Running the decision tree model using test data
- Create a new Big Data batch Job specifying Spark as the framework.
- Copy the tHDFSConfiguration from the previous Job and paste it in the palette.
- Copy the tFileInputDelimited from the previous Job and paste it in the palette.
In tFileInputDelimited, change the Folder/File value to point to the testing data.
The test data has the same schema as the training data. The only differences are the content details and the number of rows.
- Add a tPredict component to the palette. Connect tFileInputDelimited to tPredict with a Main.
- Double-click tPredict.
- Select the Define a storage configuration component check box and choose tHDFSConfiguration.
- Choose Decision Tree Model as Model Type.
Add the path to the model you created in the previous section.
Click the Sync columns button, then click the ellipsis to edit the schema.
The output panel adds a new column named label. This is the placeholder for the predicted value produced by the decision model.
Add a tReplace to the palette and connect tPredict to it with a Main.
Configure tReplace as follows.
The tReplace is needed to convert the prediction output from tPredict from a boolean representation (0.0,10) to the representation of the testing data (yes/no).
- Add a tAggregateRow and connect tReplace to tAggregateRow with a Main.
Configure tAggregateRow as follows.
The Output column in the Operations section was chosen at random. age was not chosen for any particular reason other than facilitating a count for the Group by.
tAggregateRow is used to create summary statistics of model performances used in the next section.
Add a tLogRow to the palette and connect tAggregateRow to it.
Your Job should look as follows.
Run the Job.
As for the training Job you previously created, you can run this Job either locally or on the cluster.
The expected outcome of this Job is a tabular summary that demonstrates model prediction versus the actual true outcome.
|count (age)||conversion (actual outcome)||label (predicted outcome)|
- The model incorrectly predicted (conversion = no) as true for 41 of the test cases
- The model incorrectly predicted (conversion = no) as false for 12 of the test cases
- The model accurately predicted (conversion = no) as false for 15 of the test cases
- The model accurately predicted (conversion = no) as true for 446 of the test cases
Understanding data science basics
The following concepts play a crucial role in machine learning and are part of the standard tools used by data scientists to evaluate classification models.
- Confusion Matrix: specialized table that makes it easy to visually observe classification model performance against test data where the outcomes are known (supervised learning)
- True Negative (TN): prediction equivalence to actual outcome; correct rejection
- True Positive (TP): prediction equivalence to actual outcome; correct hit
- False Negative (FN): prediction miss; erroneous rejection (Type II error)
- False Positive (FP): prediction miss; erroneous hit (Type I error)
- Accuracy: on the whole how often the classifier is correct. A = (TP+TN)/Total
- True Positive Rate (Sensitivity): TP/(TP+FN)
- True Negative Rate (Specificity): TN/(FP+TN)
Below is a generalized confusion matrix that demonstrates how it is laid out.
Here is a simple but concrete example of the use of the general confusion matrix. Assume you have trained a model to analyze a series of images of cats and dogs to identify which images are cats and which are not (in this case, they are dogs). If your model is perfect, it will predict with 100% accuracy. There is also the possibility that your model results in 0% accuracy. However, the most likely outcome is somewhere in between, and this is where a confusion matrix can help.
Below is a hypothetical outcome.
The hypothetical model accurately predicted 15 cat images (TP) and 10 dog, or not cat, images (TN). However, the model also falsely identified 40 dogs as cats (FN) and falsely identified 35 cats as dogs (FP).
- Accuracy of this classifier: (15+10) / (15+35+40+10) = .25
- Sensitivity of this classifier: 15/(15+35) = .3
- Specificity of this classifier: 10/(40+10) = .2
The conclusion is that this model on the whole is correct 25% of the time (accuracy). When the image is a cat, this model accurately predicts a cat 30% of the time (sensitivity). And when the image is not a cat, this model accurately predicts that it is not a cat 20% of the time (specificity).
Evaluating your decision tree performance
Below is a confusion matrix using the data from your test Job.
The model tries to predict (conversion = no) as being either true of false.
- TN = 15
- TP = 446
- FN = 12
- FP = 41
- Accuracy = (TP+TN)/Total = (15+446)/(446+15+12+41) = .90
- Sensitivity = TP/(TP+FN) = (446)/(446+12) = .97
- Specificity = TN/(TN+FP) = (15)/(15+41) = .27
When you tested the tree model:
- It was correct 90% of the time (accuracy)
- It accurately predicted 97% of those persons who did not result in a conversion (sensitivity)
- It accurately predicted 27% of those persons who did result in a conversion (specificity)