Computing suspect pairs and unique rows

Continuous matching

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Big Data Platform
task
Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components
EnrichPlatform
Talend Data Stewardship
Talend Studio

Procedure

  1. Double-click the first tFileOutputDelimited component to display the Basic settings view and define the component properties.
    You have already accepted to propagate the schema to the output components when you defined the input component.
  2. Clear the Define a storage configuration component check box to use the local system as your target file system.
  3. Click the [...] button next to Edit schema and use the [+] button in the dialog box to add the columns from the reference data set to the schema.

    You must add _ref at the end of the column names to be added to the suspect duplicates output. In this example: Original_id_ref, Source_ref, Site_name_ref and Address_ref.

  4. In the Folder field, set the path to the folder which will hold the output data.
  5. From the Action list, select the operation for writing data:
    • Select Create when you run the Job for the first time.
    • Select Overwrite to replace the file every time you run the Job.
  6. Set the row and field separators in the corresponding fields.
  7. Select the Merge results to single file check box, and in the Merge file path field set the path where to output the file of the suspect record pairs.
  8. Double-click the second tFileOutputDelimited component and define the component properties in the Basic settings view, as you do with the first component.
    This component creates the file which holds the unique rows generated from the input data.
  9. Press F6 to save and execute the Job.

Results

tMatchIndexPredict groups together records from the input data and the matching records from the reference data set indexed in Elasticsearch and labels the suspect pairs.
tMatchIndexPredict excludes unique records to write them in another file.

You can now clean and deduplicate the unique rows and use tMatchIndex to add them to the reference data set stored in Elasticsearch.