Machine Learning 101 - Decision Trees
Overview
This hands on tutorial demonstrates the basics of developing a machine learning routine using Talend and Spark. Specifically, decision tree learning will be leveraged for classification of real-life bank marketing data. Upon completion, you will have a working knowledge of how machine learning is integrated into a Talend workflow and some re-usable code snippets.
The source data used in this tutorial was retrieved from the UCI Machine Learning Repository. Irvine, CA: University of California, Schools of Information and Computer Science. It is available in the public domain and is attributed to: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. "A Data-Driven Approach to Predict the Success of Bank Telemarketing." Decision Support Systems, Elsevier, 62:22-31, June 2014: Bank Marketing Data Set
Prerequisites
- You have Hortonworks 2.4 (HDP) installed and configured. You can also use Hortonworks sandbox, a downloadable virtual machine (VM). For more information, see Hortonworks - Getting Started
- You have basic knowledge of Hadoop ecosystem's tools and technologies.
- You have basic knowledge of Hadoop Distributed File System (HDFS) and Spark.
- You have working knowledge of Talend Studio and Talend Big Data Platform.
- You have Talend Big Data Platform installed and configured. Any license model above this platform also work.
Creating a Spark Job for machine learning
Procedure
Creating a Hadoop cluster for machine learning
Procedure
Sampling of Machine Learning Repository data
This section details a sample of the data used in this tutorial.
This tutorial is not intended to teach data science or to detail a formal data analysis, but it is helpful to see a sample of the data.
For more information about this dataset, see UCI Machine Learning Repository.
There are ten variables, nine independent and one dependent:
- Independent: age, jobtype, maritalstatus, educationlevel, indefault, hasmortgage, haspersonalloan, numcampaigncalls ,priorcampaignoutcome
- Dependent: conversion
The independent variables, also known as feature variables, are used to predict an outcome. The dependent variable, or target variable, is what you want to predict. The sampling of data above demonstrates tuples that contain features and a target variable, both of which are needed to train your decision tree model. This type of training is called supervised learning, because the data contains both an output vector of features and a known output value.
The following steps use the training data to build a decision tree model using Spark's Machine Learning Library (MLlib). In simple terms, the goal is to determine how well the features can predict the target variable conversion using the training data, which comprises 1546 data points.
You also need to understand the overall shape and distribution of the data to ensure downstream assumptions are as accurate as possible. The following are summary statistics for the training dataset used in this article.
The levels (yes, no, failure, etc.) are reported for each categorical variable. For numerical data, the quartiles are reported. The target variable conversion has two levels, yes and no, and you can see that no appears a lot more often than yes. This imbalance presents some challenges when building a classifier model like the decision tree you are building. However, these challenges and associated mitigations are out of scope for this tutorial and will not be discussed. For more information, see Decision tree accuracy: effect of unbalanced data.
What needs to be mentioned is that the model you build will predict (conversion = no) as either being true or false. The interpretation of (conversion = no) as false in the context of the model is that (conversion = yes) is true.