tMatchGroup - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tMatchGroup compares columns in both standard input data flows and in M/R input data flows by using matching methods and groups similar encountered duplicates together.

Several tMatchGroup components can be used sequentially to match data against different blocking keys. This will refine the groups received by each of the tMatchGroup components through creating different data partitions that overlap previous data blocks and so on.

In defining a group, the first processed record of each group is the master record of the group. The other records are computed as to their distances from the master records and then are distributed to the due master record accordingly.

Purpose

This component helps you to create groups of similar data records in any source data including large volumes of data by using one or several match rules.

Depending on the Talend solution you are using, this component can be used in one, some or all of the following Job frameworks:

  • Standard: see tMatchGroup properties.

    The component in this framework is available when you have subscribed to one of the Talend Platform products or Talend Data Fabric.

  • MapReduce: see tMatchGroup in Talend Map/Reduce Jobs.

    The component in this framework is available when you have subscribed to any Talend Platform product with Big Data or Talend Data Fabric.

tMatchGroup properties

Component family

Data Quality

 

Basic settings

Schema and Edit schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

Click Sync columns to retrieve the schema from the previous component in the Job.

The output schema of this component contains the following read-only fields:

- GID: provides a group identifier of the data type String.

Note

All Jobs with tMatchGroup that are migrated from older releases into your current studio may provide a group identifier of the data type Long. If you want to have a group identifier of the data type String, you must replace the tMatchGroup component in these Jobs with tMatchGroup from the studio Palette.

- GRP_SIZE: counts the number of records in the group, computed only on the master record.

- MASTER: identifies, by true or false, if the record used in the matching comparisons is a master record. There is only one master record per group.

Each input record will be compared to the master record, if they match, the input record will be in the group.

- SCORE: measures the distance between the input record and the master record according to the matching algorithm used.

In case the tMatchGroup component is used to have multiple output flows, the score in this column decides to what output group the record should go.

- GRP_QUALITY: provides the quality of similarities in the group by taking the minimal matching value. Only the master record has a quality score.

- MERGE_INFO: this output column is available only when you have more than one tMatchGroup component in the Job. The column explains with true or false if the record is respectively a master record or not a master record in the first pass.

 

 

Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created and stored the schema in the Repository. You can reuse it in other projects and job designs. Related topic: see Talend Studio User Guide.

 

Matching Algorithm

Select from the list the algorithm you want to use in the component: Simple VSR or T-Swoosh.

PREVIEW

This button opens a configuration wizard that enables you to define production environments and their match rules or to import match rules from the studio repository. For further information, see Configuration wizard

Click the import icon to import a match rule from the Studio repository.

In the Match Rule Selector wizard, import a match rule with the same algorithm as the selected matching algorithm in the basic settings of the component. Otherwise, the Job runs with default values for the parameters which are not compatible between the Simple VSR and the t-Swoosh algorithms.

For further information about how to import rules, see Importing match rules from the studio repository

  Key Definition

Input Key Attribute

Select the column(s) from the input flow on which you want to apply a matching algorithm.

Note

When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.

 

 

Matching Function

Select a matching algorithm from the list:

Exact: matches each processed entry to all possible reference entries with exactly the same value. It returns 1 when the two strings exactly match, otherwise it returns 0.

Exact - ignore case: matches each processed entry to all possible reference entries with exactly the same value while ignoring the value case.

Soundex: matches processed entries according to a standard English phonetic algorithm. It indexes strings by sound, as pronounced in English, for example "Hello": "H400".

Levenshtein (edit distance): calculates the minimum number of edits (insertion, deletion or substitution) required to transform one string into another. Using this algorithm in the tMatchGroup component, you do not need to specify a maximum distance. The component automatically calculates a matching percentage based on the distance. This matching score will be used for the global matching calculation, based on the weight you assign in the Confidence Weight field.

Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow.

Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry.

Soundex FR: matches processed entries according to a standard French phonetic algorithm.

Jaro: matches processed entries according to spelling deviations. It counts the number of matched characters between two strings. The higher the distance is, the more similar the strings are.

Jaro-Winkler: a variant of Jaro, but it gives more importance to the beginning of the string.

Fingerprint key: matches entries after doing the following sequential process:

  1. remove leading and trailing whitespace,

  2. change all characters to their lowercase representation,

  3. remove all punctuation and control characters,

  4. split the string into whitespace-separated tokens,

  5. sort the tokens and remove duplicates,

  6. join the tokens back together,

    Because the string parts are sorted, the given order of tokens does not matter. So, Cruise, Tom and Tom Cruise both end up with a fingerprint cruise tom and therefore end up in the same cluster.

  7. normalize extended western characters to their ASCII representation, for example gödel to godel.

    This reproduce data entry mistakes performed when entering extended characters with an ASCII-only keyboard. However, this procedure can also lead to false positives, for example gödel and godél would both end up with godel as their fingerprint but they are likely to be different names. So this might work less effectively for datasets where extended characters play substantial differentiation role.

q-grams: matches processed entries by dividing strings into letter blocks of length q in order to create a number of q length grams. The matching result is given as the number of q-gram matches over possible q-grams.

Hamming: calculates the minimum number of substitutions required to transform one string into another string having the same length. For example, the Hamming distance between "masking" and "pairing" is 3.

custom...: enables you to load an external matching algorithm from a Java library using the custom Matcher column.

For further information about how to load an external Java library, see tLibraryLoad.

For further information about how to create a custom matching algorithm, see Creating a custom matching algorithm.

For a related scenario about how to use a custom matching algorithm, see Scenario 2: Using a custom matching algorithm to match entries.

 

Custom Matcher

When you select Custom as the matching type, enter the path pointing to the custom class (external matching algorithm) you need to use. This path is defined by yourself in the library file (.jar file) which you can import by using the tLibraryLoad component.

For example, to use a MyDistance.class class stored in the directory org/talend/mydistance in a user-defined mydistance.jar library, the path to be entered is org.talend.mydistance.MyDistance.

 

 

Threshold

This column is displayed when you select to have T-Swoosh as the matching algorithm.

Set a threshold between 0 and 1. 0 means that the similarity between values in the column is not measured. 1 means that you want each two compared values of the column to exactly match. Default value is 1.

 

 

Tokenized measure

Tokenization is the concept of splitting a string into words. Select the method to use to compute a tokenized measure for the selected algorithm:

NO: no tokenization method is used on the string. With this option, "John Doe" and "Jon Doe" should match.

Same place: splits the two strings by words to two lists, list1 and list2. Associates each element from list1 with the element which has the same position in list2. Using this method, "She is red and he is pink" and "Catherine is red and he is pink" should match.

Same order: splits the two strings by words to two lists, list1 and list2 and assumes that list1 is shorter than list2. Tries to associates the elements from list1 with the elements in list2 taken in the same order. Using this method, "John Doe" and "John B. Doe" match.

This method should be used only with strings which has a few words, otherwise the number of possible combinations can be large.

Any order: splits the two strings by words to two lists, list1 and list2 and assumes that list1 is shorter than list2. Tries to assign each word of list1 to a word of list2, in order to have the highest global similarity (with respect to the used similarity).

Using this method, "John Doe" and "Doe John" match.

 

 

Weight

Set a numerical weight for each attribute (column) of the key definition. The values can be anything >= 0.

 

Handle Null

To handle null values, select from the list the null operator you want to use on the column:

Null Match Null: a Null attribute only matches another Null attribute.

Null Match None: a Null attribute never matches another attribute.

Null Match All: a Null attribute matches any other value of an attribute.

For example, if we have two columns, name and firstname where the name is never null, but the first name can be null.

If we have two records:

"Doe", "John"

"Doe", ""

Depending on the operator you choose, these two records may or may not match:

Null Match Null: they do not match.

Null Match None: they do not match.

Null Match All: they match.

And for the records:

"Doe", ""

"Doe", ""

Null Match Null: they match.

Null Match None: they do not match.

Null Match All: they match.

Match Threshold

Enter the match probability. Two data records match when the probability is above the set value.

You can enter a different match threshold for each match rule.

Blocking Selection

Input Column

If required, select the column(s) from the input flow according to which you want to partition the processed data in blocks, this is usually referred to as "blocking".

Blocking reduces the number of pairs of records that needs to be examined. In blocking, input data is partitioned into exhaustive blocks designed to increase the proportion of matches observed while decreasing the number of pairs to compare. Comparisons are restricted to record pairs within each block.

Using blocking column(s) is very useful when you are processing very big data.

Advanced settings

Store on disk

Select the Store on disk check box if you want to store processed data blocks on the disk to maximize system performance.

Max buffer size: Type in the size of physical memory you want to allocate to processed data.

Temporary data directory path: Set the location where the temporary file should be stored.

 

Multiple output

Select the Separate output check box to have three different output flows:

-Uniques: when the group score (minimal distance computed in the record) is equal to 1, the record is listed in this flow.

-Matches: when the group score (minimal distance computed in the record) is higher than the threshold you define in the Confidence threshold field, the record is listed in this flow.

-Suspects: when the group score (minimal distance computed in the record) is below the threshold you define in the Confidence threshold field, the record is listed in this flow.

Confident match threshold: set a numerical value between the current Match threshold and 1. Above this threshold, you can be confident in the quality of the group.

 

Multi-pass

Select this check box to enable a tMatchGroup component to receive data sets from another tMatchGroup that precedes it in the Job. This will refine the groups received by each of the tMatchGroup components through creating data partitions based on different blocking keys.

For an example Job, see Scenario 2: Matching customer data through multiple passes

With multi-pass matching, all master records are generated but intermediate master records are removed from the output flow. Only final master and original records are kept at the end.

Propagate original values: This option is available only with the T-Swoosh algorithm. Select this check box to allow the original records from each pass (and not only the unmatched records) to also be considered in the second pass of matching, both against each other and against the survived masters. This helps to make sure that no matches are missed.

 

Sort the output data by GID

Select this check box to group the output data by the group identifier.

The output is sorted in ascending alphanumeric order by group identifier.

 

Output distance details

Select this check box to add an output column MATCHING_DISTANCES in the schema of the component. This column provides the distance between the input and master records in each group.

Note

When you use two tMatchGroup components in a Job and you want to use the Output distance details option, you must select this check box in both components before you link them together. If the components are linked, select the check box in the second component in the Job flow first then in the first component, otherwise you may have an issue as there are two columns in the output schema with the same name. Selecting this option in only one tMatchGroup is not useful and may bring schema mismatch issues.

 

Display detailed labels

Select this check box to have in the output MATCHING_DISTANCES column not only the matching distance but also the names of the columns used as key attributes in the applied rule.

For example, if you try to match on first name and last name fields, lname and fname, the output would be fname:1.0|lname:0.97 when the check box is selected and 1.0|0.97 when it is not selected.

 

Deactivate matching computation when opening the wizard

Select this check box to open the Configuration Wizard without running the match rules defined in the wizard.

This enables you to have a better experience with the component. Otherwise, the wizard may take some time to open.

 

tStatCatcher Statistics

Select this check box to collect log data at the component level. Note that this check box is not available in the Map/Reduce version of the component.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is an intermediary step. It requires an input flow as well as an output flow.

Limitation/prerequisite

n/a

Configuration wizard

The configuration wizard enables you to create different production environments, Configurations, and their match rules. You can also use the configuration wizard to import match rules created and tested in the studio and use them in your match Jobs. For further information, see Importing match rules from the studio repository.

You can not open the configuration wizard unless you link the input component to the tMatchGroup component.

To open the configuration wizard:

  1. In the studio workspace, design your job and link the components together, for example as below:

  2. Either:

    • Double-click tMatchGroup, or

    • Right-click it and from the contextual menu select Configuration Wizard, or

    • Click Preview in the basic settings view of tMatchGroup.

  3. In the popup that opens, click Skip Computation if you want to open the Configuration Wizard without running the match rules defined in it.

The configuration wizard is composed of three areas:

  • the Configuration view, where you can set the match rules and the blocking column(s).

  • the matching chart, which presents the graphic matching result,

  • the matching table, which presents the details of the matching result.

The Limit field at the upper-left corner indicates the maximum number of rows to be processed by the match rule(s) in the wizard. The by-default maximum row number is 1000.

Configuration view

From this view, you can edit the configuration of the tMatchGroup component or define different configurations in which to execute the Job. You can use these different configurations for testing purposes for example, but you can only save one configuration from the wizard, the open configuration.

In each configuration, you can define the parameters to generate match rules with the VSR or the T-Swoosh algorithm. The settings of the Configuration view differ slightly depending if you select Simple VSR or T-Swoosh in the basic settings of the tMatchGroup component.

You can define survivorship rules, blocking key(s) and multiple conditions using several match rules. You can also set different match intervals for each rule. The match results on multiple conditions will list data records that meet any of the defined rules. When a configuration has multiple conditions, the Job conducts an OR match operation. It evaluates data records against the first rule and the records that match are not evaluated against the other rules.

The parameters required to edit or create a match rule are:

  • The Key definition parameters.

  • The Match Threshold field.

  • A blocking key in the Blocking Selection table (available only for rules with the VSR algorithm).

    Defining a blocking key is not mandatory but advisable as it partitions data in blocks to reduce the number of records that need to be examined. For further information about the blocking key, see Importing match rules from the studio repository.

  • The Default Survivorship Rules parameters (available only for rules with the T-Swoosh algorithm).

To create a new configuration and new match rules with the VSR algorithm from the configuration wizard, do the following:

  1. In the basic settings of the tMatchGroup component, select Simple VSR from the Matching Algorithm list.

    It is important to have the same type of the matching algorithm selected in the basic settings of the component and defined in the configuration wizard. Otherwise the Job runs with default values for the parameters which are not compatible between the two algorithms.

  2. In the basic settings of the tMatchGroup component, click Preview to open the configuration wizard.

  3. Click the [+] button on the top right corner of the Configuration view.

    This creates, in a new tab, an exact copy of the last configuration.

  4. Edit or set the parameters for the new configuration in the Key definition and Blocking Selection tables.

  5. If needed, define several match rules for the open configuration as the following:

    • Click the [+] button on the match rule bar.

      This creates, in a new tab, an exact copy of the last rule.

    • Set the parameters for the new rule in the Key definition table and define its match interval.

    • Follow the steps above to create as many match rules for a configuration as needed. You can define a different match interval for each rule.

      When a configuration has multiple conditions, the Job conducts an OR match operation. It evaluates data records against the first rule and the records that match are not evaluated against the second rule and so on.

  6. Click the Chart button at the top right corner of the wizard to execute the Job in the open configuration.

    The matching results are displayed in the matching chart and table.

    Follow the steps above to create as many new configuration in the wizard as needed.

  7. To execute the Job in a specific configuration, open the configuration in the wizard and click the Chart button.

    The matching results are displayed in the matching chart and table.

  8. At the bottom right corner of the wizard, click either:

    • OK to save the open configuration.

      You can save only one configuration in the wizard.

    • Cancel to close the wizard and keep the configuration saved initially in the wizard.

For an example of a match rule with the T-Swoosh algorithm, see Scenario 3: Using survivorship functions to merge two records and create a master record.

Matching chart

From the matching chart, you can have a global picture about the duplicates in the analyzed data.

The Hide groups less than parameter, which is set to 2 by default, enables you to decide what groups to show in the result chart. Usually you want to hide groups of small group size.

For example, the above matching chart indicates that:

  • 48 items are analyzed and classified into 18 groups according to a given match rule and after excluding items that are unique, by setting the Hide groups less than parameter to 2.

  • 11 groups have 2 items each. In each group, the 2 items are duplicates of each other.

  • 3 groups have 3 items each. In each group, these items are duplicates of one another.

  • 3 groups have 4 items each. In each group, these items are duplicates of one another.

  • One single group has 5 duplicate items.

Matching table

From the matching table, you can read details about the different duplicates.

This table indicates the matching details of items in each group and colors the groups in accordance with their color in the matching chart.

You can decide what groups to show in this table by setting the Hide groups of less than parameter. This parameter enables you to hide groups of small group size. It is set to 2 by default.

The buttons under the table helps you to navigate back and forth through pages.

Importing match rules from the studio repository

From the tMatchGroup configuration wizard, you can import match keys from the match rules created and tested in the Profiling perspective of Talend Studio. You can then use these imported matching keys in your match Jobs.

The tMatchGroup component enables you to import from the Studio repository match rules based on the VSR or the T-Swoosh algorithms.

The VSR algorithm takes a set of records as input and groups similar encountered duplicates together according to defined match rules. It compares pairs of records and assigns them to groups. The first processed record of each group is the master record of the group. The VSR algorithm compares each record with the master of each group and uses the computed distances, from master records, to decide to what group the record should go.

The T-Swoosh algorithm enables you to find duplicates and to define how two similar records are merged to create a master record, using a survivorship function. These new merged records are used to finds new duplicates. The difference with the VSR algorithm is that the master record is in general a new record that does not exist in the list of input records.

To import match rules from the studio repository:

  1. From the configuration wizard, click the icon on the top right corner.

    The [Match Rule Selector] wizard opens listing all match rules created in the studio and saved in the repository.

  2. Select the match rule you want to import into the tMatchGroup component and use on your data.

    A warning message displays in the wizard if the match rule you want to import is defined on columns that do not exist in the input schema of tMatchGroup. You can define input columns later in the configuration wizard.

    It is important to have the same type of the matching algorithm selected in the basic settings of the component and imported from the configuration wizard. Otherwise the Job runs with default values for the parameters which are not compatible between the two algorithms.

  3. Select the Overwrite current Match Rule in the analysis check box if you want to replace the rule in the configuration wizard with the rule you import.

    If you leave the box unselected, the match keys will be imported in a new match rule tab without overwriting the current match rule in the wizard.

  4. Click OK.

    The matching key is imported from the match rule and listed as a new rule in the configuration wizard.

  5. Click in the Input Key Attribute and select from the input data the column on which you want to apply the matching key.

  6. In the Match threshold field, enter the match probability threshold. Two data records match when the computed match score is above this value.

  7. In the Blocking Selection table, select the column(s) from the input flow which you want to use as a blocking key.

    Defining a blocking key is not mandatory but advisable. Using a blocking key partitions data in blocks and so reduces the number of records that need to be examined, as comparisons are restricted to record pairs within each block. Using blocking key(s) is very useful when you are processing big data set.

    The Blocking Selection table in the component is different from the Generation of Blocking Key table in the match rule editor in the Profiling perspective.

    The blocking column in tMatchGroup could come from a tGenKey component (and would be called T_GEN_KEY) or directly from the input schema (it could be a ZIP column for instance). While the Generation of Blocking Key table in the match rule editor defines the parameters necessary to generate a blocking key; this table is equivalent to the tGenKey component. The Generation of Blocking Key table generates a blocking column BLOCK_KEY used for blocking.

  8. Click the Chart button in the top right corner of the wizard to execute the Job using the imported match rule and show the matching results in the wizard.

Scenario 1: Grouping output data in separate flows according to the minimal distance computed in each record

This scenario describes a basic Job that compares columns in the input file using the Jaro-Winkler matching method on the lname and fname column and the q-grams matching method on the address1 column. It then groups the output records in three output flows:

  • Uniques: lists the records which group score (minimal distance computed in the record) is equal to 1.

  • Matches: lists the records which group score (minimal distance computed in the record) is higher than the threshold you define in the Confidence threshold field.

  • Suspects: lists the records which group score (minimal distance computed in the record) is below the threshold you define in the Confidence threshold field.

For another scenario that groups the output records in one single output flow, see Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFileInputExcel, tMatchGroup and three tLogRows.

  2. Connect tFileInputExcel to tMatchGroup using the Main link.

  3. Connect tMatchGroup to the three tLogRow components using the Unique rows, Confident groups and Uncertain groups links.

    Warning

    To be able to set three different output flows for the processed records, you must first select the Separate output check box in the Advanced settings view of the tMatchGroup component. For further information, see the section about configuring the tMatchGroup component.

Configuring the input component

The main input file contains eight columns: account_num, lname, fname, mi, address1, city, state_province and postal_code. The data in this input file has problems such as duplication, names spelled differently or wrongly, different information for the same customer.