Creating a match analysis - 6.5

Talend Open Studio for MDM User Guide

EnrichVersion
6.5
EnrichProdName
Talend Open Studio for MDM
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This analysis enables you to create match rules and test them on data to assess the number of duplicates . Currently, you can test match rules only on columns in the same table.

Prerequisite(s): At least one database or file connection is defined under the Metadata node. For further information, see Connecting to a database.

The sequence of setting up a match analysis involves the following steps:

  1. Creating the connection to a data source from inside the editor if no connection has been defined under the Metadata folder in the Studio tree view.

    For further information, see How to configure the match analysis.

  2. Defining the table or the group of columns you want to search for similar records using match processes.

    For further information, see How to define a match analysis from the Analysis folder or How to define a match analysis from the Metadata folder.

  3. Defining blocking keys to reduce the number of pairs that need to be compared.

    For further information, see How to define a match rule.

  4. Defining match keys, the match methods according to which similar records are grouped together. For further information, see How to define a match rule.

  5. Exporting the match rules from the match analysis editor and centralize them in the studio repository.

    For further information, see How to import or export match rules.

How to define a match analysis from the Analysis folder

  1. In the DQ Repository tree view, expand Data Profiling.

  2. Right-click the Analysis folder and select New Analysis.

    The [Create New Analysis] wizard opens.

  3. Start typing match in the filter field, select Match Analysis and then click Next to open a wizard.

  4. Set the analysis name and metadata and then click Next.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  5. Expand DB connections or FileDelimited connections depending on if the columns you want to match are in a database or a delimited file.

  6. Browse to the columns you want to match, select them and then click Finish.

    The columns you select should be in the same table. Currently, the match analysis does not work on columns in different tables.

    The match analysis editor opens listing the selected columns.

    You can also define a match analysis starting from the table or columns you want to match. For further information, see How to define a match analysis from the Metadata folder.

  7. Modify the parameters in the match analysis editor according to your needs.

    For further information, see How to configure the match analysis.

How to define a match analysis from the Metadata folder

  1. In the DQ Repository tree view, expand Metadata.

  2. Either:

    • Browse the database or the file connection to the table you want to match, right-click it and select Match Analysis, or

    • Browse the database or the file connection to the columns you want to match, right-click them and select Analyze matches.

    The columns you select should be in the same table. Currently, the match analysis does not work on columns in different tables.

    The match analysis editor opens listing all columns in the table or the group of selected columns.

  3. Set the analysis name and metadata and click Next to open the analysis editor.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  4. Modify the parameters in the match analysis editor according to your needs.

    For further information, see How to configure the match analysis.

How to configure the match analysis

  1. In the Limit field in the match analysis editor, set the number for the data records you want to use as a data sample.

    Data is displayed in the Data Preview table.

  2. If required, click any column name in the table to sort the sample data in an ascending or descending order.

  3. In the match analysis editor, select:

    Option

    To...

    locate the selected table under the Metadata node in the tree view.

    New Connection

    create a connection to a database or to a file from inside the match analysis editor where you can expand this new connection and select the columns on which to do the match.

    For further information about how to create a connection to data sources, see Connecting to a database and Connecting to a file.

    Select Data

    update the selection of the columns listed in the table.

    If you change the data set for an analysis, the charts that display the match results of the sample data will be cleared automatically. You must click Chart to compute the match results for the new data set you have defined.

    Refresh Data

    refresh the view of the columns listed in the table.

    n first rows

    or

    n random rows

    lists in the table N first data records from the selected columns or list N random records from the selected columns.

    Select Blocking Key

    define the column(s) from the input flow according to which you want to partition the processed data in blocks.

    For more information, see How to define a match rule.

    Select Matching Key

    define the match rules and the column(s) from the input flow on which you want to apply the match algorithm.

    For more information, see How to define a match rule.

The Data Preview table has some additional columns which show the results of matching data. The indication of these columns are as the following:

Column

Description

GID

represents the group identifier.

GRP_SIZE

counts the number of records in the group, computed only on the master record.

MASTER

identifies, by true or false, if the record used in the matching comparisons is a master record. There is only one master record per group.

Each input record will be compared to the master record, if they match, the input record will be in the group.

SCORE

measures the distance between the input record and the master record according to the matching algorithm used.

GRP_QUALITY

only the master record has a quality score which is the minimal value in the group.

ATTRIBUTE_SCORE

lists the match score and the names of the columns used as key attributes in the applied rules.

How to define a match rule

You can define match rules from the match analysis editor by defining:

  • blocking keys, the column(s) from the input flow according to which you want to partition the processed data in blocks,

  • matching keys and survivorship rules, the match algorithms you want to apply on columns from the input flow.

Defining a blocking key

Defining a blocking key is not mandatory but strongly advisable. Using a blocking key to partition data in blocks reduces the number of records that need to be examined as comparisons are restricted to record pairs within each block. Using blocking column(s) is very useful when you are processing a big data set.

  1. In the Data section, click the Select Blocking Key tab and then click the name of the column(s) you want to use to partition the processed data in blocks.

    Blocking keys that have the exact name of the selected columns are listed in the Blocking Key table.

    You can define more than one column in the table, but only one blocking key will be generated and listed in the BLOCK_KEY column in the Data table.

    For example, if you use an algorithm on the country and lnamecolumns to process records that have the same first character, data records that have the same first letter in the country and last names are grouped together in the same block. Comparison is restricted to record within each block.

    To remove a column from the Blocking key table, right-click it and select Delete or click on its name in the Data table.

  2. Select an algorithm for the blocking key, and set the other parameters in the Blocking Key table as needed.

    In this example, only one blocking key is used. The first character of each word in the country column is retrieved and listed in the BLOCK_KEY column.

  3. Click Chart to compute the generated key, group the sample records in the Data table and display the results in a chart.

    This chart allows you to visualize the statistics regarding the number of blocks and to adapt the blocking parameters according to the results you want to get.

Defining a matching key with the VSR algorithm

  1. In the Record linkage algorithm section, select Simple VSR Matcher if it is not selected by default.

  2. In the Data section, click the Select Matching Key tab and then click the name of the column(s) on which you want to apply the match algorithm.

    Matching keys that have the exact names of the selected input columns are listed in the Matching Key table.

    To remove a column from this table, right-click it and select Delete or click on its name in the Data table.

  3. Select the match algorithms you want to use from the Matching Function column and the null operator from the Handle Null column.

    In this example two match keys are defined, you want to use the Levenshtein and Jaro-Winkler match methods on first names and last names respectively and get the duplicate records.

    If you want to use an external user-defined matching algorithm, select Custom and use the Custom Matcher column to load the Jar file of the user-defined algorithm.

Defining a matching key with the T-Swoosh algorithm

  • Make sure first to select the column(s) on which to apply the match algorithm either from the Data section by using the Select Matching Key tab, or directly from the Matching Key table.

To create a match key using the T-Swoosh algorithm:

  1. In the Record linkage algorithm section, select T-Swoosh. The Simple VSR Matcher is for use with Talend Data Quality only.

  2. In the Match and Survivor section, you define the criteria to use when matching data records. Click the [+] button to add a new rule, and then set the following criteria.

    • Match Key Name: Enter the name of your choice for the match key.

    • Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you want to use an external user-defined matching algorithm.

    • Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-defined algorithm.

    • Threshold: Specify the match score (between 0 and 1) above which two values should be considered a match.

    • Confidence Weight: Set a numerical weight (between 1 and 10) to the column you want to use as a match key. This value is used to give greater or lesser importance to certain columns when performing the match.

    • Handle Null: Specify how to deal with data records which contain null values.

      • nullMatchNull: If both records contain null values, consider this a match.

      • nullMatch None: If one record contains a null, do not consider this a match.

      • nullMatch All: If one record contains a null, consider this a match.

    • Survivorship Function: Select how two similar records will be merged from the drop-down list.

      • Concatenate: It adds the content of the first record and the content of the second record together - for example, Bill and William will be merged into BillWilliam. In the Parameter field, you can specify a separator to be used to separate values.

      • Prefer True (for booleans): It always set booleans to True in the merged record, unless all booleans in the source records are False.

      • Prefer False (for booleans): It always sets booleans to False in the merged record, unless all booleans in the source records are True.

      • Most common: It validates the most frequently-occurring field value in each duplicates group.

      • Most recent or Most ancient: The former validates the earliest date value and the latter the latest date value in each duplicates group. The relevant reference column must be of the Date type.

      • Longest or Shortest: The former validates the longest field value and the latter the shortest field value in each duplicates group.

      • Largest or Smallest: The former validates the largest numerical value and the latter the smallest numerical value in a duplicates group.

        Warning

        Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric type.

      • Most trusted source: It takes the data coming from the source which has been defined as being most trustworthy. The most trusted data source is set in the Parameter field.

    • Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator you want to use for concatenating data.

  3. In the Match threshold field, enter the match probability threshold.

    Two data records match when the probability is above this value.

    In the Confident match threshold field, set a numerical value between the current Match threshold and 1.

  4. In the Default Survivorship Rules section, you define how to survive matches for certain data types: Boolean, Data, Number and String. If you do not specify the behavior for any or all data types, the default behavior is applied.

    • Click the [+] button to add a new row for each data type.

    • In the Data Type column, select the relevant data type from the drop-down list.

    • In the Survivorship Function column, select how two similar records will be merged from the drop-down list. Note that, depending on the data type, only certain choices may be relevant.

      Warning

      Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric type.

    • Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to specify a separator you want to use for concatenating data.

  5. Save your changes.

Editing rules and displaying sample results

  1. To define a second match rule, put your cursor on the top right corner of the Matching Key table, click the [+] button to create a new rule.

    Follow the steps outlined in How to define a match rule to define matching keys.

    When you define multiple conditions in the match rule editor, an OR match operation is conducted on the analyzed data. Records are evaluated against the first rule and the records that match are not evaluated against the second rule and so on.

  2. Click the button at the top right corner of the Matching Key or Match and Survivor section and replace the default name of the rule with a name of your choice.

    If you define more than one rule in the match analysis, you can use the up and down arrows in the dialog box to change the rule order and thus decide what rule to execute first.

  3. Click OK.

    The rules are named and ordered accordingly in the section.

  4. In the Match threshold field, enter the match probability threshold.

    Two data records match when the probability is above this value.

    In the Confident match threshold field, set a numerical value between the current Match threshold and 1.

    If the GRP-QUALITY calculated by the match analysis is equal to or greater than the Confident match threshold, you can be confident about the quality of the group.

  5. Click Chart to compute the groups according to the blocking key and match rule you defined in the editor and display the results of the sample data in a chart.

    This chart shows a global picture about the duplicates in the analyzed data. The Hide groups less than parameter is set to 2 by default. This parameter enables you to decide what groups to show in the chart, you usually want to hide groups of small group size.

    The chart in the above image indicates that out of the 1000 sample records you examined and after excluding items that are unique, by having the Hide groups less than parameter set to 2:

    • 49 groups have 2 items each. In each group, the 2 items are duplicates of each other.

    • 7 groups have 3 duplicate items and the last group has 4 duplicate items.

    Also, the Data table indicates the match details of items in each group and colors the groups in accordance with their colors in the match chart.

How to show the match results

To collect duplicates from the input flow according to the match types you define, Levenshtein and Jaro-Winkler in this example, do the following:

  • Save the settings in the match analysis editor and press F6.

    The analysis is executed. The match rule and blocking key are computed against the whole data set and the Analysis Results view is open in the editor.

    In this view, the charts give a global picture about the duplicates in the analyzed data. In the first tables, you can read statistics about the count of processed records, distinct records with only one occurrence, duplicate records (matched records) and suspect records that did not match the rule. Duplicate records represent the records that matched with a good score - above the confidence threshold. One record of the matched pair is a duplicate that should be discarded and the other is the survivor record.

    In the second table, you can read statistics about the number of groups and the number of records in each group. You can click any column header in the table to sort the results accordingly.

How to import or export match rules

You can import match rules from the studio repository and use them in the match editor to test them on your data. You can also export match rules from the match editor and save them in the studio repository.

You can import match rules stored in the studio repository into the match editor and test them on your data. You can also export match rules from the editor and store them in the studio repository.

Importing match rules from the repository

  1. In the match editor, click the icon on top of the editor.

  2. In the [Match Rule Selector] wizard, select the match rule you want to import into the match analysis editor and use on the analyzed data.

    A warning message displays in the wizard if the match rule you want to import is defined on columns that do not exist in the analyzed data. Ignore the message as you can define input columns later in the match analysis editor.

  3. Select the Overwrite current Match Rule in the analysis check box if you want to replace the rule in the editor with the rule you import, otherwise, leave the box unselected.

  4. Click OK.

    The match rule is imported and the matching and blocking keys and /or survivorship rules are listed in the Matching Key and Blocking Key tables respectively.

  5. Click in the Input column and select from the list the column on which you want to apply the imported blocking and matching keys.

    If you have in the analyzed data a column that match the input column in the imported keys, it will be automatically defined in the Input column, you do not need to define it yourself.

    When you analyze data with multiple conditions, the match results will list data records that meet any of the defined rules. When you execute the match analysis, an OR match operation is conducted on data and data records are evaluated against the first rule and the records that match are not evaluated against the other rules.

Exporting match rules to the repository

  1. In the match editor, click the icon on top of the editor.

  2. In the open wizard, enter a name for the rule and set other metadata, if needed.

  3. Click Finish.

    The rule editor opens on the rule settings and the rule is saved and listed under Libraries > Rules > Match in the DQ Repository tree view.