Configuring the components - 7.0

Fuzzy matching

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
EnrichPlatform
Talend Studio

Procedure

  1. Double-click tFileInputDelimited to open its Basic settings view and define its properties.
  2. Click the three-dot button next to the File Name field to browse to the file holding the input data.
  3. If needed, set Header, Footer, and Limit.
    For this scenario, set Header to 1. Footer and limit for the number of processed rows are not set.
  4. Click Edit schema to open a dialog box where you can describe the data structure of the source delimited file.
    In this scenario, the source schema is made of the following columns: ID, Status, FirstName, Email, City, Initial, and ZipCode.
  5. Double click tFuzzyUniqRow to display its Basic settings view and define its properties.
  6. In the Key Attribute column, select the check boxes next to the columns you want to check using the defined matching method, Firstname, Email, City, and ZipCode in this example.
  7. In the Matching Type column, set the matching methods you want to use on each of the selected columns.
    In this example, Leveshtein is to be used as the matching method for the FirstName, Email, and ZipCode columns, Double Metaphone is to be used as the matching method for the City column.
    Then set the minimum and maximum distances for the Levenshtein method. In this method, the distance is the number of character changes (insertion, deletion or substitution) that needs to be carried out in order for the entry to fully match the reference. In this example, we want the min. distance to be 0 and the max. distance to be 2. This will output all entries in the FirstName, Email, and ZipCode columns that exactly match or that have maximum two character changes. There is no minimum nor maximum distance to set for Double Metaphone because this matching method is based on phonetic discrepancies in the input data.
  8. Double click the first tFileOutputExcel to display its Basic settings view and define its properties.
  9. Set the destination file name as well as the Sheet name and select the Include header check box.
  10. Do the same for the second tFileOutputExcel.