Machine Learning 101 - Decision Trees
This hands on tutorial demonstrates the basics of developing a machine learning routine using Talend and Spark. Specifically, decision tree learning will be leveraged for classification of real-life bank marketing data. Upon completion, you will have a working knowledge of how machine learning is integrated into a Talend workflow and some re-usable code snippets.
The source data used in this tutorial was retrieved from the UCI Machine Learning Repository. Irvine, CA: University of California, Schools of Information and Computer Science. It is available in the public domain and is attributed to: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. "A Data-Driven Approach to Predict the Success of Bank Telemarketing." Decision Support Systems, Elsevier, 62:22-31, June 2014: Bank Marketing Data Set
- You have Hortonworks 2.4 (HDP) installed and configured. You can also use Hortonworks sandbox, a downloadable virtual machine (VM). For more information, see Hortonworks - Getting Started
- You have basic knowledge of Hadoop ecosystem's tools and technologies.
- You have basic knowledge of Hadoop Distributed File System (HDFS) and Spark.
- You have working knowledge of Talend Studio and Talend Big Data Platform.
- You have Talend Big Data Platform installed and configured. Any license model above this platform also work.
Creating a Spark Job for machine learning
- Open Talend Studio and expand Job Designs.
Right-click Big Data Batch and create a new job specifying Spark as the framework.
Creating a Hadoop cluster for machine learning
- Expand Metadata.
Right-click Hadoop Cluster and create a new cluster.
Specify a Linux OS user on the cluster.
Here, the user puccini was already created.
Training and test data used in this article have been slightly modified from the original source and pre-loaded into HDFS. Those data sets can be downloaded below.
Configure the HDFS connection as follows.
Sampling of Machine Learning Repository data
This section details a sample of the data used in this tutorial.
This tutorial is not intended to teach data science or to detail a formal data analysis, but it is helpful to see a sample of the data.
For more information about this dataset, see UCI Machine Learning Repository.
There are ten variables, nine independent and one dependent:
- Independent: age, jobtype, maritalstatus, educationlevel, indefault, hasmortgage, haspersonalloan, numcampaigncalls ,priorcampaignoutcome
- Dependent: conversion
The independent variables, also known as feature variables, are used to predict an outcome. The dependent variable, or target variable, is what you want to predict. The sampling of data above demonstrates tuples that contain features and a target variable, both of which are needed to train your decision tree model. This type of training is called supervised learning, because the data contains both an output vector of features and a known output value.
The following steps use the training data to build a decision tree model using Spark's Machine Learning Library (MLlib). In simple terms, the goal is to determine how well the features can predict the target variable conversion using the training data, which comprises 1546 data points.
You also need to understand the overall shape and distribution of the data to ensure downstream assumptions are as accurate as possible. The following are summary statistics for the training dataset used in this article.
The levels (yes, no, failure, etc.) are reported for each categorical variable. For numerical data, the quartiles are reported. The target variable conversion has two levels, yes and no, and you can see that no appears a lot more often than yes. This imbalance presents some challenges when building a classifier model like the decision tree you are building. However, these challenges and associated mitigations are out of scope for this tutorial and will not be discussed. For more information, see Decision tree accuracy: effect of unbalanced data.
What needs to be mentioned is that the model you build will predict (conversion = no) as either being true or false. The interpretation of (conversion = no) as false in the context of the model is that (conversion = yes) is true.
Creating a training data schema reference
- Right-click the HDFS connection you previously created and choose Retrieve Schema.
Navigate to the pre-loaded training data file located at /user/puccini/machinelearning/decisiontrees/marketing/marketing_campaign_train.csv.
Click Next, name the schema and adjust the data types as needed.
In this case, the defaults are accurate.
- Click Finish.
- Add a tHDFSConfiguration component to the palette.
- Set Property Type to Repository.
Select the HDFS connection you created, MarketingCampaignData.
Accessing training data
- Add a tFileDelimitedInput component to the palette.
- Set the Property Type to Repository, then choose HDFS:MarketingCampaignData.
- Click the ellipsis to the right of Folder/File and navigate to the training dataset in HDFS, in this case it is located at /user/puccini/machinelearning/marketing/marketing_campaign_train.csv.
For Schema, choose Repository and select the schema you created earlier.