tNaiveBayesModel properties in Spark Batch Jobs - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Component family

Machine Learning / Classification

 

Basic settings

Define a storage configuration component

Select the configuration component to be used to provide the configuration information for the connection to the target file system such as HDFS or S3.

If you leave this check box clear, the target file system is the local system.

Note that the configuration component to be used must be present in the same Job. For example, if you have dropped a tHDFSConfiguration component in the Job, you can select it to write the result in a given HDFS system.

 

Column type

Complete this table to define the feature type of each input column in order to compute the classifier model.

  • Column: this column lists the input column automatically retrieved from the input schema.

  • Usage: select the type of the feature that the records from each input column represent.

    For example, people's ages represent the continuous feature while people's genders the categorical feature (also called discrete feature).

    If you select Label for an input column, the records of this column are used as the class names (Target in terms of classification) of the elements to be classified; if you need to ignore a column in the model computation, select Unused.

  • Bin edges: this column is activated only when the input column represents the continuous feature. It allows you to discretize the continuous data into bins, that is to say, to partition the continuous data into half-open segments by putting boundary values within double quotation marks.

    For example, if you enter "18;35" for a column as to people's ages, these ages will be grouped into three segments, among which the ages less or equal to 18 go into a segment, ages greater than 18 and less or equal to 35 into a segment and ages greater than 35 the other segment.

  • Categories: this column is activated only when the input column represents the categorical feature. You need to enter the names of each category to be used and separate them using a semicolon (;), for example, "male;female".

    Note that the categories you enter must exist in the input column.

  • Class name: this column is activated only when the Label option has been selected in the Usage column. You need to enter the name of the classes used in the classification and separate them using a semicolon (;), for example, "platinum-level customer;gold-level customer".

Training percentage

Enter the percentage (expressed in the decimal form) of the input data to be used to train the classifier model. The rest of the data is used to test the model.

 

PMML model path

Enter the directory in which you need to store the generated classifier model in the file system to be used.

For further information about the PMML format used by Naive Bayes model, see http://www.dmg.org/v4-2-1/NaiveBayes.html.

 

Parquet model name

Enter the name you need to use for the classifier model.

Usage in Spark Batch Jobs

In a Talend Spark Batch Job, it is used as an end component and requires an input link. The other components used along with it must be Spark Batch components, too. They generate native Spark code that can be executed directly in a Spark cluster.

Model evaluation

The parameters you need to set are free parameters and so their values may be provided by previous experiments, empirical guesses or the like. They do not have any optimal values applicable for all datasets.

Therefore, you need to train the classifier model you are generating with different sets of parameter values until you can obtain the best Accuracy (ACC) score and the optimal Precision, Recall and F1-measure scores for each class:

  • The Accuracy score varies from 0 to 1 to indicate how accurate a classification is. More approximate to 1 an Accuracy score is, more accurate the corresponding classification is.

  • The Precision score, also varying from 0 to 1, indicates how relevant the elements selected by the classification are to a given class.

  • The Recall score, still varying from 0 to 1, indicates how many relevant elements are selected.

  • The F1-measure score is the harmonic mean of the Precision score and the Recall score.

Log4j

These scores can be output to the console of the Run view when you execute the Job when you have added the following code to the Log4j view in the [Project Settings] dialog box.

<!-- DataScience Logger -->
<logger name= "org.talend.datascience.mllib" additivity= "false" >
<level value= "INFO" />
<appender-ref ref= "CONSOLE" />
</logger>

These scores are output along with the other Log4j INFO-level information. If you want to prevent outputting the irrelevant information, you can, for example, change the Log4j level of this kind of information to WARN but note you need to keep this DataScience Logger code as INFO.

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.