Configuration wizard

The configuration wizard enables you to create different production environments, Configurations, and their match rules.

You can also use the configuration wizard to import match rules created and tested in Talend Studio and use them in your match Jobs. For further information, see Importing match rules from the repository.

You can not open the configuration wizard unless you link the input component to the tMatchGroup component.

Opening the configuration wizard

Procedure

In the Talend Studio workspace, design your Job and link the components together, for example as below:
To configure tMatchGroup, do one of the following:
- Double-click tMatchGroup or right-click it and from the contextual menu select Configuration Wizard.
- In the Basic settings view of tMatchGroup, click Preview.
If you want to open the Configuration wizard without running the match rules defined in it, in the popup that opens, click Skip Computation.

Results

The configuration wizard opens and is composed of three areas:

The Configuration view, where you can set the match rules and the blocking columns.
The matching chart, which presents the graphic matching result.
The matching table, which presents the details of the matching result.

The Limit field at the upper-left corner indicates the maximum number of rows to be processed by the match rule(s) in the wizard. The by-default maximum row number is 1000.

Configuration view

From this view, you can edit the configuration of the tMatchGroup component or define different configurations in which to execute the Job.

About this task

You can use these different configurations for testing purposes for example, but you can only save one configuration from the wizard, the open configuration.

In each configuration, you can define the parameters to generate match rules with the VSR or the T-Swoosh algorithm. The settings of the Configuration view differ slightly depending if you select Simple VSR or T-Swoosh in the basic settings of the tMatchGroup component.

You can define survivorship rules, blocking key(s) and multiple conditions using several match rules. You can also set different match intervals for each rule. The match results on multiple conditions will list data records that meet any of the defined rules. When a configuration has multiple conditions, the Job conducts an OR match operation. It evaluates data records against the first rule and the records that match are not evaluated against the other rules.

The parameters required to edit or create a match rule are:

The Key definition parameters.
The Match Threshold field.
A blocking key in the Blocking Selection table (available only for rules with the VSR algorithm).

Defining a blocking key is not mandatory but advisable as it partitions data in blocks to reduce the number of records that need to be examined. For further information about the blocking key, see Importing match rules from the repository.
The Survivorship Rules for Columns parameters (available only for rules with the T-Swoosh algorithm).
The Default Survivorship Rules parameters for data types (available only for rules with the T-Swoosh algorithm).

For an example of a match rule with the T-Swoosh algorithm, see Using survivorship functions to merge two records and create a master record.

Procedure

In the basic settings of the tMatchGroup component, select Simple VSR from the Matching Algorithm list.
It is important to have the same type of the matching algorithm selected in the basic settings of the component and defined in the configuration wizard. Otherwise the Job runs with default values for the parameters which are not compatible between the two algorithms.
In the basic settings of the tMatchGroup component, click Preview to open the configuration wizard.
Click the [+] button on the top right corner of the Configuration view.
This creates, in a new tab, an exact copy of the last configuration.
Edit or set the parameters for the new configuration in the Key definition and Blocking Selection tables.
If needed, define several match rules for the open configuration as the following:
1. Click the [+] button on the match rule bar to create an exact copy of the last rule in a new tab.
2. Set the parameters for the new rule in the Key definition table and define its match interval.
3. Follow the steps above to create as many match rules for a configuration as needed.
  You can define a different match interval for each rule.
When a configuration has multiple conditions, the Job conducts an OR match operation. It evaluates data records against the first rule and the records that match are not evaluated against the second rule and so on.
Click the Chart button at the top right corner of the wizard to execute the Job in the open configuration.
The matching results are displayed in the matching chart and table.

Follow the steps above to create as many new configuration in the wizard as needed.
To execute the Job in a specific configuration, open the configuration in the wizard and click the Chart button.
The matching results are displayed in the matching chart and table.
At the bottom right corner of the wizard, click either:
- OK to save the open configuration.
  
  You can save only one configuration in the wizard.
- Cancel to close the wizard and keep the configuration saved initially in the wizard.

Matching chart

From the matching chart, you can have a global picture about the duplicates in the analyzed data.

About this task

Important: When you are using the Apache Spark Batch component, you can view the matching chart only when the data are loaded from a local file or database.

The Hide groups less than parameter, which is set to 2 by default, enables you to decide what groups to show in the result chart. Usually you want to hide groups of small group size.

For example, the above matching chart indicates that:

46 items are analyzed and classified into 17 groups according to a given match rule and after excluding items that are unique, by setting the Hide groups less than parameter to 2.
9 groups have 2 items each. In each group, the 2 items are duplicates of each other.
5 groups have 3 items each. In each group, these items are duplicates of one another.
2 groups have 4 items each. In each group, these items are duplicates of one another.
One single group has 5 duplicate items.

Matching table

From the matching table, you can read details about the different duplicates.

About this task

Important: When you are using the Apache Spark Batch component, you can view the matching table only when the data are loaded from a local file or database.

This table indicates the matching details of items in each group and colors the groups in accordance with their color in the matching chart.

You can decide what groups to show in this table by setting the Hide groups of less than parameter. This parameter enables you to hide groups of small group size. It is set to 2 by default.

The buttons under the table helps you to navigate back and forth through pages.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here