Modeling the accident-prone areas in a city

Machine Learning

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Machine Learning components
Data Quality and Preparation > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
EnrichPlatform
Talend Studio

This scenario applies only to subscription-based Talend products with Big Data.

For more technologies supported by Talend, see Talend components.

In this scenario, the tKMeansModel component is used to analyze a set of sample geographical data about the destination of ambulances in a city in order to model the accident-prone areas.

A model like this can be employed to help determine the optimal locations for building hospitals.

You can download this sample data from here. It consists of pairs of latitudes and longitudes.

The sample data was randomly and automatically generated for demonstration purposes only and in any case it does not reflect the situation of these areas in the real world.

Prerequisite:
  • The Spark version to be used is 1.4 onwards.

  • The sample data is stored in your Hadoop file system and you have proper rights and permissions to at least read it.

  • Your Hadoop cluster is properly installed and is running.

If you are not sure about these requirements, ask the administrator of your Hadoop system.
The components to be used are:
  • tHDFSConfiguration: it defines the HDFS connection to be used by Spark and by the other components.

  • tFileInputDelimited: it loads the sample data into the data flow of the Job.

  • tReplicate: it replicates the sample data and caches the replication.

  • tKMeansModel: it analyzes the data to train the model and writes the model to HDFS.

  • tModelEncoder: it pre-process the data to prepare proper feature vectors to be used by tKMeansModel.

  • tPredict: it applies the KMeans model on the replication of the sample data. In the real-world practice, this data should be a set of reference data to test the model accuracy.

  • tFileOutputDelimited: it writes the result of the prediction to HDFS.