Grouping customer numerical data into clusters on HDFS (deprecated) - 7.0

Machine Learning

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Machine Learning components
Data Quality and Preparation > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
EnrichPlatform
Talend Studio

This scenario applies only to subscription-based Talend products with Big Data.

For more technologies supported by Talend, see Talend components.

The scenario is inspired from a research paper on model-based clustering. Its data can be found at Wholesale customers Data Set. The research paper is available at Enhancing the selection of a model-based clustering with external categorical variables. This scenario is included in the Data Quality Demos project you can import into your Talend Studio . For further information, see the Talend Studio User Guide.

The Job in this scenario connects to a given Hadoop distributed file system (HDFS), groups customers of a "wholesale distributor" into two clusters using the algorithms in tMahoutClustering and outputs data on a given HDFS.

The data set has 440 samples that refer to clients of a wholesale distributor. It includes the annual spending in monetary units on diverse product categories like fresh and grocery products or milk.

The data set refers to customers from different channels - Horeca (Hotel/Restaurant/Cafe) or Retail (sale of goods in small quantities) channel, and from different regions (Lisbon/Oporto/other).

This Job uses:

  • tMahoutClustering to compute the clusters for the input data set.

  • two tAggregateRow components to count the number of clients in both clusters based on the region and channel columns.

  • three tMap components to map the channel and region input flows into two separate output flows. The components are also used to map the single clusterID column received from tMahoutClustering to two-column data flow that feed the region and the channel clusters.

  • two tHDFSOutput components to write data to HDFS in two output files.

Prerequisites: Before being able to use the tMahoutClustering component, you must have a functional Hadoop system.