This hands on tutorial demonstrates the basics of developing a machine learning routine using Talend and Spark. Specifically, decision tree learning will be leveraged for classification of real-life bank marketing data. Upon completion, you will have a working knowledge of how machine learning is integrated into a Talend workflow and some re-usable code snippets.
The source data used in this tutorial was retrieved from the UCI Machine Learning Repository. Irvine, CA: University of California, Schools of Information and Computer Science. It is available in the public domain and is attributed to: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. "A Data-Driven Approach to Predict the Success of Bank Telemarketing." Decision Support Systems, Elsevier, 62:22-31, June 2014: Bank Marketing Data Set
- You have Hortonworks 2.4 (HDP) installed and configured. You can also use Hortonworks sandbox, a downloadable virtual machine (VM). For more information, see Create HDFS Metadata - Hortonworks.
- You have basic knowledge of Hadoop ecosystem's tools and technologies.
- You have basic knowledge of Hadoop Distributed File System (HDFS) and Spark.
- You have working knowledge of Talend Studio and Talend Big Data Platform.
- You have Talend Big Data Platform installed and configured. Any license model above this platform also work.