tMatchModel properties for Apache Spark Batch - 7.0

Matching with machine learning

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
EnrichPlatform
Talend Data Stewardship
Talend Studio

These properties are used to configure tMatchModel running in the Spark Batch Job framework.

The Spark Batch tMatchModel component belongs to the Data Quality family.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

Basic settings

Define a storage configuration component

Select the configuration component to be used to provide the configuration information for the connection to the target file system such as HDFS.

If you leave this check box clear, the target file system is the local system.

The configuration component to be used must be present in the same Job. For example, if you have dropped a tHDFSConfiguration component in the Job, you can select it to write the result in a given HDFS system.

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the [Repository Content] window.

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

Matching key

Select the columns on which you want to base the match computation.

Matching label column

Select the column from the input flow which holds the label you set manually on the suspect pairs of records.

If you select the Integration with Data Stewardship check box, this list does not appear. In this case, the matching label column is the TDS_ARBITRATION_LEVEL column, which holds the label(s) you set on the suspect pairs of records set using Talend Data Stewardship.

Matching model location

Select the Save the model on file system check box and in the Folder field, set the path to the local folder where you want to generate the matching files.

If you want to store the model in a specific file system, for example S3 or HDFS, you must use the corresponding component in the Job and select the Define a storage configuration component check box in the component basic settings.

The button for browsing does not work with the Spark Local mode; if you are using the Spark Yarn or the Spark Standalone mode, ensure that you have properly configured the connection in a configuration component in the same Job, such as tHDFSConfiguration.

Integration with Data Stewardship

Select this check box to set the connection parameters to the Talend Data Stewardship server.

If you select this check box, tMatchModel uses the sample suspect records labeled in a Grouping campaign defined on the Talend Data Stewardship server, which means this component can be used as a standalone component.

Data Stewardship Configuration

  • URL:

    Enter the address to access the Talend Data Stewardship server suffixed with /data-stewardship/, for example http://<server_address>:19999/data-stewardship/.

    If you are working with Talend Cloud Data Stewardship, use one of the following addresses to access the application:

    • https://tds.us.cloud.talend.com/data-stewardship for the US data center.
    • https://tds.eu.cloud.talend.com/data-stewardship for the EU data center.
  • Username and Password:

    Enter the authentication information to the Talend Data Stewardship server.

  • Campaign:

    Displays the technical name of the campaign once it is selected in the basic settings. However, you can modify the field value to replace it with a context parameter for example and pass context variables to the Job at runtime. This technical name is always used to identify a campaign when the Job communicates with Talend Data Stewardship whatever is the value in the Campaign field.

    Click Find a Campaign to open a dialog box which lists the Grouping campaigns on the server for which you are the Campaign owner or you have the access rights.

    Click the refresh button to retrieve the campaign details from the Talend Data Stewardship server.

Advanced settings

Max token number for phonetic comparison

Set the maximum number of the tokens to be used in the phonetic comparison.

When the number of tokens exceeds what has been defined in this field, no phonetic comparison is done on the string.

Random Forest hyper-parameters tuning

Number of trees range: Enter a range for the decision trees you want to build. Each decision tree is trained independently using a random sample of features.

Increasing this range can improve the accuracy by decreasing the variance in predictions, but will increase the training time.

Maximum tree-depth range: Enter a range for the decision tree depth at which the training should stop adding new nodes. New nodes represent further tests on features on internal nodes and possible class labels held by leaf nodes.

Generally speaking, a deeper decision tree is more expressive and thus potentially more accurate in predictions, but it is also more resource consuming and prone to overfitting.

Set Checkpoint Interval

Set the frequency of checkpoints. It is recommended to leave the default value (10).

Before setting a value for this parameter, activate checkpointing and set the checkpoint directory in the Spark Configuration tab of the Run view.

For further information about checkpointing, see Logging and checkpointing the activities of your Apache Spark Job.

Cross-validation parameters

Number of folds: Enter a numeric value of bins which are used as separate training and test datasets.

Evaluation metric type: Select a type from the list. For further information, see Precision and recall.

Random Forest parameters

Subsampling rate: Enter the numeric value to indicate the fraction of the input dataset used for training each tree in the forest. The default value 1.0 is recommended, meaning to take the whole dataset for test.

Subset Strategy: Select the strategy about how many features should be considered on each internal node in order to appropriately split this internal node (actually the training set or subset of a feature on this node) into smaller subsets. These subsets are used to build child nodes.

Each strategy takes a different number of features into account to find the optimal point among these features for split. This point could be, for example, the age 35 of the categorical feature age.

  • auto: This strategy is based on the number of trees you have set in the Number of trees in the forest field in the Basic settings view. This is the default strategy to be used.

    If the number of trees is 1, the strategy is actually all; if this number is greater than 1, the strategy is sqrt.

  • all: The total number of features is considered for split.

  • sqrt: The number of features to be considered is the square root of the total number of features.

  • log2: The number of features to be considered is the result of log2(M), in which M is the total number of features.

Max Bins

Enter the numeric value to indicate the maximum number of bins used for splitting features.

The continuous features are automatically transformed to ordered discrete features.

Min Info gain

Enter the minimum number of information gain to be expected from a parent node to its child nodes. When the number of information gain is less than this minimum number, node split is stopped.

The default value of the minimum number of information gain is 0.0, meaning that no further information is obtained by splitting a given node. As a result, the splitting could be stopped.

For further information about how the information gain is calculated, see Impurity and Information gain from the Spark documentation.

Min instance per Node

Enter the minimum number of training instances a node should have to make it valid for further splitting.

The default value is 1, which means when a node has only 1 row of training data, it stops splitting.

Impurity

Select the measure used to select the best split from each set of splits.

  • gini: it is about how often an element could be incorrectly labelled in a split.

  • entropy: it is about how unpredictable the information in each split is.

For further information about how each of the measures is calculated, see Impurity measures from the Spark documentation.

Set a random seed

Enter the random seed number to be used for bootstrapping and choosing feature subsets

Data Stewardship Configuration

This field appears only if you select the Integration with Data Stewardship check box in the Basic settings.

Campaign ID:

Displays the technical name of the campaign once it is selected in the basic settings. However, you can modify the field value to replace it with a context parameter for example and pass context variables to the Job at runtime. This technical name is always used to identify a campaign when the Job communicates with Talend Data Stewardship whatever is the value in the Campaign field.

Batch Size: Specify the number of records to be processed in each batch.

Do not change the default value unless you are facing performance issues. Increasing the batch size can improve the performance but setting a too high value could cause Job failures.

Use Timestamp format for Date type Select the check box to output dates, hours, minutes and seconds contained in your Date-type data. If you clear this check box, only years, months and days are outputted.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job.

Spark Batch Connection

In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
  • Yarn mode (Yarn client or Yarn cluster):
    • When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab.

    • When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab.

    • When using Altus, specify the S3 bucket or the Azure Data Lake store (technical preview) for Job deployment in the Spark configuration tab.
    • When using other distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration or tS3Configuration.

This connection is effective on a per-Job basis.