Configuring the components - 7.0

Data matching

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
EnrichPlatform
Talend Studio

Procedure

  1. Double-click tRecordMatching to display its Basic settings view and define its properties.
  2. Click the Edit schema button to open a dialog box. Here you can define the data you want to pass to the output components.
    In this example we want to pass to the tRecordMatching component the name and email columns from the first tMysqlInput component, and the ref_name and ref_ email columns from the second tMysqlInput component.
    The MATCHING_DISTANCE and the MATCHING_WEIGHT columns in the output schema are defined by default.
    The MATCHING_WEIGHT column is always between 0 and 1. It is a global distance between sets of columns (defined by the columns to be matched).
    The MATCHING_DISTANCE column will print a distance for each of the columns on which we use an algorithm. The results will be separated by a vertical bar (pipe).
    Click OK to close the dialog box and proceed to the next step.
  3. In the Key Definition area of the Basic settings view of tRecordMatching, click the plus button to add two columns to the list.
  4. Select the input columns and the output columns you want to do the fuzzy matching on from the Input key attribute and Lookup key attribute lists respectively.
    In this example, select name and email as input attributes and ref-name and ref_email as lookup attributes.
    Note: When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.

    For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.

  5. Click in the Matching type column and select from the list q-gram, the method to be used on the first column to check the incoming data against the reference data.
  6. Set the matching type for the second column, Levenshtein in this example.
    The minimum and maximum possible match values are defined in the Advanced settings view. You can change the by-default values.
  7. From the Tokenized measure list, select not to use a tokenized distance for the selected algorithms.
  8. In the Weight column, set a numerical weight for each of the columns used as key attributes.
  9. Click in the cell of the Handle Null column and select the null operator you want to use to handle null attributes in the columns.
  10. If required, click the plus button below the Blocking Selection table to add one or more lines in the table and then click in the line and select from the list the column you want to use as a blocking value.
    Using a blocking value reduces the number of pairs of records that needs to be examined. The input data is partitioned into exhaustive blocks based on the blocking value. This will decrease the number of pairs to compare as comparison is restricted to record pairs within each block. Check Comparing columns and grouping in the output flow duplicate records that have the same functional key for a use case of the blocking value.
  11. Click the Advanced settings tab to open the corresponding view and make sure to select the Simple VSR algorithm.
  12. Double-click the first tLogRow component to display its Basic settings view, and select Table in the Mode area to display the source file and the tRecordMatching results together to be able to compare them.
  13. Do the same for the other two tLogRow components.