Data matching with Talend tools - 7.2

author
Talend Documentation Team
EnrichVersion
Cloud
7.2
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
EnrichPlatform
Talend Studio

What is data matching?

Data matching is the process that enables you to find records representing the same entity in a data set.

General definition

Data matching enables you to:
  • find duplicates, potential duplicates and non-duplicates in a data source
  • analyze data and return weighted probabilities of matching
  • merge identical or similar entries into a single entry; and
  • reduce disparity across different data sources.

Record linkage

Record linkage consists of identifying records that refer to the same entity in a data set.

Two types of data record linkage exist:
  • deterministic record linkage, which is based on identifiers that match; and
  • probabilistic record linkage, which is based on the probability that identifiers match.

What to do before matching?

Profiling data

Data profiling is the process of examining the data available in different data sources and collecting statistics and information about this data.

Data profiling helps assess the quality level of the data according to defined set goals.

Data quality issues can stem from many different sources including, legacy systems, data migrations, database modifications, human communication inconsistencies and countless other potential anomalies. Regardless of the source, data quality issues can impact the ability of business to use its data to make insightful decisions.

If data is of a poor quality, or managed in structures that cannot be integrated to meet the needs of the enterprise, business processes and decision-making suffer.

Compared to manual analysis techniques, data profiling technology improves the enterprise ability to meet the challenge of managing data quality and to address the data quality challenges faced during data migrations and data integrations.

Standardizing data

Standardizing data before trying to perform matching tasks is an essential step to improve matching accuracy.
Talend provides different ways to standardize data:
  • You can standardize data against indices. Synonyms are standardized or converted to the "master" words.

    For more information on available data synonym dictionaries, see the Talend Data Fabric Studio User Guide.

  • You can use address validation components to standardize address data against Experian QAS, Loqate and MelissaData validation tools. The addresses returned by these tools are consistent and variations in address representations are eliminated. As addresses are standardized, matching gets easier.

    For more information on the tQASBatchAddressRow, tLoqateAddressRow and tMelissaDataAddress components, see Address standardization.

  • You can use the tStandardizePhoneNumber component to standardize a phone number, based on the formatting convention of the country of origin.

    For more information on phone number standardization, see Phone number standardization.

  • You can use other more generic components to transform your data and get more standardized records, such as tReplace, tReplaceList, tVerifyEmail, tExtractRegexFields or tMap.

How do you match?

The classical matching approach

The classical approach consists of sorting data into similar sized partitions which have the same attribute, choosing metrics and defining matching rules.

Blocking by partitions

Record linkage is a demanding task because each record must be compared to the other ones from the data set. To improve the efficiency of this task, the blocking technique is a required step most of the time.

Blocking consists of sorting data into similar sized partitions which have the same attribute. The objective is to restrict comparisons to the records grouped within the same partition.

To create efficient partitions, you need to find attributes which are unlikely to change, such as a person's first name or last name. By doing this, you improve the reliability of the blocking step and the computation speed of the task.

It is recommended to use the tGenKey component to generate blocking keys and to view the distribution of the blocks.

For more information on generating blocking keys, see Identification.

Choosing metrics and defining matching rules

After blocking data into similar sized group, you can create match rules and test them before using them in the tMatchGroup component.

For more information about creating a match analysis, see Talend Data Fabric Studio User Guide.

Matching functions in the tMatchGroup component

tMatchGroup helps you create groups of similar data records in any source of data including large volumes of data by using one or several match rules.

Each created group is made up of a master record and records similar to this master record. The matching functions used to compute similarity measures between similar records and the master record include the following ones:
  • Phonetic algorithms, such as Soundex or Metaphone, are used to match names.
  • The Levensthein distance calculates the minimum number of edits required to transform one string to another.
  • The Jaro distance matches processed entries according to spelling deviations.
  • The Jaro-Winkler distance is a variant of Jaro giving more importance to the beginning of the string.

For more information on how to use the tMatchGroup component in standard and Map/Reduce Jobs, see Data matching.

The Simple VSR Matcher and the T-Swoosh algorithms

You can choose between two algorithms when using the tMatchGroup component:
  • Simple VSR Matcher
  • T-Swoosh

For more information about match analyses, see "Create a match rule" on Talend Help Center.

When do records match?

Two records match when the following conditions are met:
  • When using the T-Swoosh algorithm, the score returned for each matching function must be higher than the threshold you set.
  • The global score, computed as a weighted score of the different matching functions, must be higher than the match threshold.

Multiple passes

In general, different partitioning schemes are necessary. This requires using sequentially tMatchGroup components to match data against different blocking keys.

For an example of how to match data through multiple passes, see Data matching.

Working with the tRecordMatching component

tRecordMatching joins compared columns from the main flow with reference columns from the lookup flow. According to the matching strategy you define, tRecordMatching outputs the match data, the possible match data and the rejected data. When arranging your matching strategy, the user-defined matching scores are critical to determine the match level of the data of interest.

For more information about the tRecordMatching component, see Data matching.

The machine learning approach

The machine learning approach is useful when you want to match very high volume of data.

The data matching process can be automated by making a model learn and predict matches.

The data matching process

The advantages of the machine learning approach over the classical approach are the following:

  • The different blocking mechanism permits faster and more scalable computation. In the machine learning approach, blocking is not partitioning: a record can belong to different blocks and the size of the block is clearly delimited, which may not be the case with the tGenKey component.
  • The rules learnt and stored by the machine learning model can be much more complex and less arbitrary than human-designed matching rules.
  • Configuring components is more simple. The machine learning model learns automatically matching distances and similarity threshold, among other things.
  1. The first step consists of pre-analyzing a data set using the tMatchPairing component. Unique records, exact match records, suspect match pairs and a sample of the suspect match pairs are outputted by the tMatchPairing component.

    For examples of how to compute suspect pairs and writing a sample in Talend Data Stewardship and how to compute suspect pairs and suspect sample from source data, see Matching with machine learning.

  2. The second step consists of labeling the suspect match pairs from the sample as "match" or "no-match" manually. You can leverage Talend Data Stewardship to make the labeling task easier.

    For more information about how to add a Grouping campaign to identify duplicates in a data sample in Talend Data Stewardship, see Matching with machine learning.

    In Talend Data Stewardship, grouping tasks allow authorized data stewards to validate a relationship between pairs or groups of records. The outcome of a grouping task is the list of records associated to each other.

    You can use more than two classes, for example “match”, “potential match” and “different”.

    For more information on handling grouping tasks to decide on relationship among pairs of records in Talend Data Stewardship, see Talend Data Stewardship Examples.

  3. The third step consists of submitting the suspect match pairs you labeled to the tMatchModel component for learning and outputting a classifier model.

    For examples of how to generate a matching model, see Matching with machine learning.

  4. The fourth step consists of labeling suspect pairs for large data sets automatically using the model computed by tMatchModel with the tMatchPredict component.

    For an example of labeling suspect pairs with assigned labels, see Matching with machine learning .

What is a good sample?

The sample should be well-balanced: the number of records in each class - "match" and "no match" - should be approximately the same. An imbalanced data sample yields an unsatisfactory model.

The sample should be diverse: the more diverse the examples in the sample are, the more effective the rules learnt by the model will be.

The sample should be the right size: if you have a large data set with millions of records, then a few hundreds or thousands of examples may be enough. If your data set contains less than 10 000 records, then the sample size should be between 1 and 10% of the full data set.

How does tMatchModel generate a model?

The machine learning algorithm computes different measures, which are called features, to get as much information as possible on the defined columns.

To generate the model, tMatchModel analyzes the data using the Random Forest algorithm. A random forest is a collection of decision trees used to solve a classification problem. In a decision tree, each node corresponds to a question about the features associated to the input data. A random forest grows many decision trees to improve the accuracy of the classification and to generate a model.

For more information on data matching on Apache Spark, see Matching with machine learning.

Surviving master records

You can use the tRuleSurvivorship component or Talend Data Stewardship to survive master records.

Merging records using tRuleSurvivorship

Once you estimated duplicates and possible duplicates that are grouped together, you can use the tRuleSurvivorship component to create a single representation for each group of duplicates using the best-of-breed data. This representation is called a survivor.

For an example of how to create a clean data set from the suspect pairs labeled by tMatchPredict and the unique rows computed by tMatchPairing, see Deduplication.

Using Talend Data Stewardship for clerical review and merging records

You can add merging campaigns in Talend Data Stewardship to review and modify survivorship rules, create master records and merge data.

For further information on merging campaigns in Talend Data Stewardship, see Talend Data Stewardship Examples.

In Talend Data Stewardship, data stewards are business users in charge of resolving data stewardship tasks:
  • Classifying data by assigning a label chosen among a predefined list of arbitration choices.
  • Merging several potential duplicate records into one single record.

    Merging tasks allow authorized data stewards to merge several potential duplicate source records into one single record (golden record). The outcome of a merging task is the golden record produced by data stewards.

    For further information on merging tasks in Talend Data Stewardship, see Talend Data Stewardship Examples.

    Source records can come from the same source (database deduplication) or different sources (databases reconciliation).

How do you rematch using machine learning components?

Doing continuous matching

If you want to match new records against a clean data set, you do not need to restart the matching process from scratch.

You can reuse and index the clean set and to do continuous matching.

To be able to perform continuous matching tasks, Elasticsearch version 5.1.2+ must be running.

The continuous matching process is made up of the following steps:

  1. The first step consists of computing suffixes to separate clean and deduplicated records from a data set and indexing them in Elasticsearch using tMatchIndex.

    For an example of how to index a data in Elasticsearch using tMatchIndex, see Continuous matching.

  2. The second step consists of comparing the indexed records with new records having the same schema and outputting matching and non-matching records using tMatchIndexPredict. This component uses the pairing and matching models generated by tMatchPairing and tMatchModel.

    For an example of how to matching new records against records from a reference data set, see Continuous matching.

You can then clean and deduplicate the non-matching records using tRuleSurvivorship and populate the clean data set indexed in Elasticsearch using tMatchIndex.