tALSModel properties in Spark Batch Jobs - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Component family

Machine Learning / Recommendation

 

Basic settings

Define a storage configuration component

Select the configuration component to be used to provide the configuration information for the connection to the target file system such as HDFS or S3.

If you leave this check box clear, the target file system is the local system.

Note that the configuration component to be used must be present in the same Job. For example, if you have dropped a tHDFSConfiguration component in the Job, you can select it to write the result in a given HDFS system.

 

Feature table

Complete this table to map the input columns with the three factors required to compute the recommender model.

  • Input column: select the input column to be used from the drop-down list.

    These selected columns must contain the user IDs, the product IDs and the ratings and the data must be numerical values.

  • Feature type: select the factor that each selected input column needs to be mapped with. The three factors are User_ID, Product_ID and Rating.

This map allows tASLModel to read the right type of data for each required factor.

Training percentage

Enter the percentage (expressed in the decimal form) of the input data to be used to train the recommender model. The rest of the data is used to test the model.

 

Number of latent factors

Enter the number of the latent factors, with which each user or product feature is measured.

 

Number of iterations

Enter the number of iterations you want the Job to perform to train the model.

This number should be smaller than 30 in order to avoid stack overflow issues and in practices, the convergent score (RMSE score) can often be obtained before you have to use a number beyond 30.

However, if you need to perform more than 30 iterations, you must increase the stack size used to run the Job; to do this, you can add the -Xss argument, for example -Xss2048k, to the JVM Settings table in the Advanced settings tab of the Run view. For further information about the JVM Settings table, see Talend Studio User Guide.

 

Regularization factor

Enter the regularization number you want to use to avoid overfitting.

 

Build model for implicit feedback data set

Select this check box to enable tALSModel to handle the implicit data sets.

Contrary to the explicit data sets such as the ranking of a product, an implicit data set only implies users' preferences, for example, a record showing how frequently a user is buying a certain item.

If you leave this check box clear, tALSModel handles the explicit data sets only.

For related details about how the ALS model handles the implicit data sets, see the documentation of Spark in the following link: https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html.

 

Confidence coefficient for implicit training

Enter the number to indicate the level of confidence you have in the observed user preferences.

 

Parquet model path

Enter the directory in which you need to store the generated recommender model in the file system to be used.

 

Parquet model name

Enter the name you need to use for the recommender model.

Usage in Spark Batch Jobs

In a Talend Spark Batch Job, it is used as an end component and requires an input link. The other components used along with it must be Spark Batch components, too. They generate native Spark code that can be executed directly in a Spark cluster.

Note that the parameters you need to set are free parameters and so their values may be provided by previous experiments, empirical guesses or the like. They do not have any optimal values applicable for all datasets. Therefore, you need to train the model you are generating with different sets of parameter values until you can obtain the minimum RMSE score. This score is outputted in the console of the Run view each time a Job execution is done.

MLlib installation

In Apache Spark V1.3 or earlier versions of Spark, the Spark machine learning library, MLlib, uses the gfortran runtime library. You need to ensure that this library is already present in every node of the Spark cluster to be used.

For further information about MLlib and this library, see the related documentation from Spark.

Log4j

These scores can be output to the console of the Run view when you execute the Job when you have added the following code to the Log4j view in the [Project Settings] dialog box.

<!-- DataScience Logger -->
<logger name= "org.talend.datascience.mllib" additivity= "false" >
<level value= "INFO" />
<appender-ref ref= "CONSOLE" />
</logger>

These scores are output along with the other Log4j INFO-level information. If you want to prevent outputting the irrelevant information, you can, for example, change the Log4j level of this kind of information to WARN but note you need to keep this DataScience Logger code as INFO.

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.