Configuring the match analysis - 7.1

Talend Data Management Platform Studio User Guide

author
Talend Documentation Team
EnrichVersion
7.1
EnrichProdName
Talend Data Management Platform
task
Design and Development
EnrichPlatform
Talend Studio

Procedure

  1. In the Limit field in the match analysis editor, set the number for the data records you want to use as a data sample.
    Data is displayed in the Data Preview table.
  2. If required, click any column name in the table to sort the sample data in an ascending or descending order.
  3. In the match analysis editor, select:

    Option

    To...

    locate the selected table under the Metadata node in the tree view.

    New Connection

    create a connection to a database or to a file from inside the match analysis editor where you can expand this new connection and select the columns on which to do the match.

    For further information about how to create a connection to data sources, see Connecting to a database and Connecting to a file.

    Select Data

    update the selection of the columns listed in the table.

    If you change the data set for an analysis, the charts that display the match results of the sample data will be cleared automatically. You must click Chart to compute the match results for the new data set you have defined.

    Refresh Data

    refresh the view of the columns listed in the table.

    n first rows

    or

    n random rows

    lists in the table N first data records from the selected columns or list N random records from the selected columns.

    Select Blocking Key

    define the column(s) from the input flow according to which you want to partition the processed data in blocks.

    For more information, see Defining a match rule.

    Select Matching Key

    define the match rules and the column(s) from the input flow on which you want to apply the match algorithm.

    For more information, see Defining a match rule.

    Store on disk

    store processed data blocks on the disk to maximize system performance.

    Max buffer size: Type in the size of physical memory you want to allocate to processed data.

    Temporary data directory path: Set the path to the directory where the temporary file should be stored.

Results

The Data Preview table has some additional columns which show the results of matching data. The indication of these columns are as the following:

Column

Description

GID

represents the group identifier.

GRP_SIZE

counts the number of records in the group, computed only on the master record.

MASTER

identifies, by true or false, if the record used in the matching comparisons is a master record. There is only one master record per group.

Each input record will be compared to the master record, if they match, the input record will be in the group.

SCORE

measures the distance between the input record and the master record according to the matching algorithm used.

GRP_QUALITY

only the master record has a quality score which is the minimal value in the group.

ATTRIBUTE_SCORE

lists the match score and the names of the columns used as key attributes in the applied rules.

These columns are the columns you can find in the output schema of thetMatchGroup component. For further information, see the tMatchGroup documentation in the Talend Components Reference Guide.