Configuring the match analysis - Cloud - 8.0

Talend Studio User Guide

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-02-29
Available in...

Big Data Platform

Cloud API Services Platform

Cloud Big Data Platform

Cloud Data Fabric

Cloud Data Management Platform

Data Fabric

Data Management Platform

Data Services Platform

MDM Platform

Real-Time Big Data Platform

Procedure

  1. In the Limit field, set the number for the data records you want to use as a data sample.
    Screenshot of the Match Analysis view.
  2. Optional: Click any column name in the table to sort the sample data in an ascending or descending order.
  3. In the match analysis editor, configure the options.
    Option Purpose
    Show in Data Quality repository icon Locate the selected table under the Metadata node in the tree view.
    New Connection Create a connection to a database or to a file from inside the match analysis editor where you can expand this new connection and select the columns on which to do the match.

    For further information about how to create a connection to data sources, see Creating connections to data sources.

    Select Data Update the selection of the columns listed in the table.

    If you change the dataset for an analysis, the charts that display the match results of the sample data will be cleared automatically. You must click Chart to compute the match results for the new dataset you have defined.

    Refresh Data Refresh the view of the columns listed in the table.
    n first rows

    or

    n random rows

    List in the table N first data records from the selected columns or list N random records from the selected columns.
    Select Blocking Key Define the columns from the input flow according to which you want to partition the processed data in blocks.

    For more information, see Defining a match rule.

    Select Matching Key Define the match rules and the columns from the input flow on which you want to apply the match algorithm.

    For more information, see Defining a match rule.

    Store on disk Store processed data blocks on the disk to maximize system performance.

    Max buffer size: Type in the size of physical memory you want to allocate to processed data.

    Temporary data directory path: Set the path to the directory where the temporary file must be stored.

    Allow drill down: Select to enable the View rows feature from the Analysis Results tab. It displays a list of duplicate rows or groups of the same size. For more information, see Viewing and exporting the analyzed data.

Results

The Data Preview table has additional columns that show the results of matching data:
  • GID: represents the group identifier.
  • GRP_SIZE: counts the number of records in the group, computed only on the master record.
  • MASTER: identifies, by true or false, if the record used in the matching comparisons is a master record. There is only one master record per group.

    Each input record will be compared to the master record, and if they match, the input record will be in the group.

  • SCORE: measures the distance between the input record and the master record according to the matching algorithm used.
  • GRP_QUALITY: only the master record has a quality score which is the minimal value in the group.
  • ATTRIBUTE_SCORE: lists the match score and the names of the columns used as key attributes in the applied rules.