What is data matching?
General definition
- Find duplicates, potential duplicates and non-duplicates in a data source
- Analyze data and return weighted probabilities of matching
- Merge identical or similar entries into a single entry; and
- Reduce disparity across different data sources.
Record linkage
Record linkage consists of identifying records that refer to the same entity in a dataset.
- Deterministic record linkage, which is based on identifiers that match; and
- Probabilistic record linkage, which is based on the probability that identifiers match.
What to do before matching?
Profiling data
Data profiling helps assess the quality level of the data according to defined set goals.
Data quality issues can stem from many different sources including, legacy systems, data migrations, database modifications, human communication inconsistencies and countless other potential anomalies. Regardless of the source, data quality issues can impact the ability of business to use its data to make insightful decisions.
If data are of a poor quality, or managed in structures that cannot be integrated to meet the needs of the enterprise, business processes and decision-making suffer.
Compared to manual analysis techniques, data profiling technology improves the enterprise ability to meet the challenge of managing data quality and to address the data quality challenges faced during data migrations and data integrations.
Standardizing data
- You can standardize data against indices. Synonyms are standardized or converted
to the "master" words.
For more information on available data synonym dictionaries, see the Talend Data Fabric Studio User Guide.
- You can use address validation components to standardize address data against
Experian QAS, Loqate and MelissaData validation tools. The addresses returned by
these tools are consistent and variations in address representations are
eliminated. As addresses are standardized, matching gets easier.
For more information on the tQASBatchAddressRow, tLoqateAddressRow and tMelissaDataAddress components, see Address standardization.
For more information on address validation components, see the online publication about the tQASBatchAddressRow, tLoqateAddressRow and tMelissaDataAddress components on Talend Help Center (https://help.talend.com).
- You can use the tStandardizePhoneNumber component to
standardize a phone number, based on the formatting convention of the country of
origin.
For more information on phone number standardization, see Phone number standardization.
For more information on phone number standardization, see the online publication about the tStandardizePhoneNumber component on Talend Help Center (https://help.talend.com).
- You can use other more generic components to transform your data and get more standardized records, such as tReplace, tReplaceList, tVerifyEmail, tExtractRegexFields or tMap.
How do you match?
The classical matching approach
Blocking by partitions
Blocking consists of sorting data into similar sized partitions which have the same attribute. The objective is to restrict comparisons to the records grouped within the same partition.
To create efficient partitions, you need to find attributes which are unlikely to change, such as a person's first name or last name. By doing this, you improve the reliability of the blocking step and the computation speed of the task.
It is recommended to use the tGenKey component to generate blocking keys and to view the distribution of the blocks.
For more information on generating blocking keys, see Identification.
Choosing metrics and defining matching rules
After blocking data into similar sized group, you can create match rules and test them before using them in the tMatchGroup component.
For more information about creating a match analysis, see Talend Data Fabric Studio User Guide.
Matching functions in the tMatchGroup component
tMatchGroup helps you create groups of similar data records in any source of data including large volumes of data by using one or several match rules.
- Phonetic algorithms, such as Soundex or Metaphone, are used to match names.
- The Levensthein distance calculates the minimum number of edits required to transform one string to another.
- The Jaro distance matches processed entries according to spelling deviations.
- The Jaro-Winkler distance is a variant of Jaro giving more importance to the beginning of the string.
For more information on how to use the tMatchGroup component in standard and Map/Reduce Jobs, , see Classical matching.
The Simple VSR Matcher and the T-Swoosh algorithms
- Simple VSR Matcher
- T-Swoosh
For more information about match analyses, see "Create a match rule" on Talend Help Center.
When do records match?
- When using the T-Swoosh algorithm, the score returned for each matching function must be higher than the threshold you set.
- The global score, computed as a weighted score of the different matching functions, must be higher than the match threshold.
Multiple passes
In general, different partitioning schemes are necessary. This requires using sequentially tMatchGroup components to match data against different blocking keys.
For an example of how to match data through multiple passes, see Classical matching.
Working with the tRecordMatching component
tRecordMatching joins compared columns from the main flow with reference columns from the lookup flow. According to the matching strategy you define, tRecordMatching outputs the match data, the possible match data and the rejected data. When arranging your matching strategy, the user-defined matching scores are critical to determine the match level of the data of interest.
For more information about the tRecordMatching component, see Classical matching.
The machine learning approach
The machine learning approach is useful when you want to match very high volume of data.
The data matching process can be automated by making a model learn and predict matches.
The data matching process
The advantages of the machine learning approach over the classical approach are the following:
- The different blocking mechanism permits faster and more scalable computation. In the machine learning approach, blocking is not partitioning: a record can belong to different blocks and the size of the block is clearly delimited, which may not be the case with the tGenKey component.
- The rules learnt and stored by the machine learning model can be much more complex and less arbitrary than human-designed matching rules.
- Configuring components is more simple. The machine learning model learns automatically matching distances and similarity threshold, among other things.
- The first step consists of pre-analyzing a data set using the
tMatchPairing component. Unique records,
exact match records, suspect match pairs and a sample of the suspect match pairs
are outputted by the tMatchPairing
component.
For more examples, see Computing suspect pairs and writing a sample in Talend Data Stewardship and Computing suspect pairs and suspect sample from source data.
- The second step consists of labeling the suspect match pairs
from the sample as "match" or "no-match" manually. You can leverage Talend Data Stewardship to make the labeling task
easier.
You can use more than two classes, for example “match”, “potential match” and “different”.
For more information on handling grouping tasks to decide on relationship among pairs of records in Talend Data Stewardship, see Talend Data Stewardship Examples.
For more information on grouping tasks in Talend Data Stewardship, see the online publication about handling grouping tasks to decide on relationship among pair of records on Talend Help Center (https://help.talend.com).
- The third step consists of submitting the suspect match pairs
you labeled to the tMatchModel component
for learning and outputting a classifier model.
For examples of how to generate a matching model, see the scenarios.
You can find examples of how to generate a matching model on Talend Help Center (https://help.talend.com).
- The fourth step consists of labeling suspect pairs for large
data sets automatically using the model computed by tMatchModel with the tMatchPredict component.
For an example of labeling suspect pairs with assigned labels, see the scenario .
You can find an example of how to label suspect pairs with assigned labels on Talend Help Center (https://help.talend.com).
What is a good sample?
The sample should be well-balanced: the number of records in each class - "match" and "no match" - should be approximately the same. An imbalanced data sample yields an unsatisfactory model.
The sample should be diverse: the more diverse the examples in the sample are, the more effective the rules learnt by the model will be.
The sample should be the right size: if you have a large data set with millions of records, then a few hundreds or thousands of examples may be enough. If your data set contains less than 10 000 records, then the sample size should be between 1 and 10% of the full data set.
How does tMatchModel generate a model?
The machine learning algorithm computes different measures, which are called features, to get as much information as possible on the defined columns.
To generate the model, tMatchModel analyzes the data using the Random Forest algorithm. A random forest is a collection of decision trees used to solve a classification problem. In a decision tree, each node corresponds to a question about the features associated to the input data. A random forest grows many decision trees to improve the accuracy of the classification and to generate a model.
For more information on data matching on Apache Spark, see the properties of tMatchModel.
Surviving master records
Merging records using tRuleSurvivorship
Once you estimated duplicates and possible duplicates that are grouped together, you can use the tRuleSurvivorship component to create a single representation for each group of duplicates using the best-of-breed data. This representation is called a survivor.
For an example of how to create a clean data set from the suspect pairs labeled by tMatchPredict and the unique rows computed by tMatchPairing, see Matching with machine learning.
You can find an example of how to create a clean data set from the suspect pairs labeled by tMatchPredict on Talend Help Center (https://help.talend.com).
Using Talend Data Stewardship for clerical review and merging records
You can add merging campaigns in Talend Data Stewardship to review and modify survivorship rules, create master records and merge data.
For further information on merging campaigns in Talend Data Stewardship, see Talend Data Stewardship Examples.
In Talend Data Stewardship, data stewards are business users in charge of resolving data stewardship tasks:- Classifying data by assigning a label chosen among a predefined list of arbitration choices.
- Merging several potential duplicate records into one single
record.
Merging tasks allow authorized data stewards to merge several potential duplicate source records into one single record (golden record). The outcome of a merging task is the golden record produced by data stewards.
For further information on merging tasks in Talend Data Stewardship, see Talend Data Stewardship Examples.
For further information on merging tasks in Talend Data Stewardship, see the online publication about handling merging tasks on Talend Help Center (https://help.talend.com).
Source records can come from the same source (database deduplication) or different sources (databases reconciliation).
How do you rematch using machine learning components?
Doing continuous matching
If you want to match new records against a clean data set, you do not need to restart the matching process from scratch.
You can reuse and index the clean set and to do continuous matching.To be able to perform continuous matching tasks, Elasticsearch version 5.1.2+ must be running.
The continuous matching process is made up of the following steps:
- The first step consists of computing suffixes to separate clean and
deduplicated records from a data set and indexing them in Elasticsearch using
tMatchIndex.
For an example of how to index a data in Elasticsearch using tMatchIndex, see this scenario.
You can find an example of how to index a data in Elasticsearch using tMatchIndexon Talend Help Center (https://help.talend.com).
- The second step consists of comparing the indexed records with new
records having the same schema and outputting matching and non-matching records
using tMatchIndexPredict. This component uses the
pairing and matching models generated by tMatchPairing and tMatchModel.
For an example of how to matching new records against records from a reference dataset, see this scenario.
You can find an example of how to do continuous matching using tMatchIndex on Talend Help Center (https://help.talend.com).
You can then clean and deduplicate the non-matching records using tRuleSurvivorship and populate the clean data set indexed in Elasticsearch using tMatchIndex.
Exact matching
The exact matching considers two records an exact match when a subset of their attributes is identical.
Component
tUniqRow
Ensures data quality of input or output flow in a Job.
tUniqRow compares entries and sorts out duplicate entries from the input flow.
Depending on the Talend product you are using, this component can be used in one, some or all of the following Job frameworks:
-
Standard: see tUniqRow Standard properties.
The component in this framework is available in all Talend products.
-
MapReduce: see tUniqRow MapReduce properties (deprecated).
The component in this framework is available in all subscription-based Talend products with Big Data and Talend Data Fabric.
-
Spark Batch: see tUniqRow properties for Apache Spark Batch.
The component in this framework is available in all subscription-based Talend products with Big Data and Talend Data Fabric.
-
Spark Streaming: see tUniqRow properties for Apache Spark Streaming.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
tUniqRow Standard properties
These properties are used to configure tUniqRow running in the Standard Job framework.
The Standard tUniqRow component belongs to the Data Quality family.
The component in this framework is available in all Talend products.
Basic settings
Schema and Edit schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
This component offers the advantage of the dynamic schema feature. This allows you to retrieve unknown columns from source files or to copy batches of columns from a source without mapping each column individually. For further information about dynamic schemas, see Talend Studio User Guide. This dynamic schema feature is designed for the purpose of retrieving unknown columns of a table and is recommended to be used for this purpose only; it is not recommended for the use of creating tables. |
|
Built-In: You create and store the schema locally for this component only. |
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. |
Unique key |
In this area, select one or more columns to carry out deduplication on the particular column(s) - Select the Key attribute check box to carry out deduplication on all the columns - Select the Case sensitive check box to differentiate upper case and lower case |
Advanced settings
Only once each duplicated key |
Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates. |
Use of disk (suitable for processing large row set) |
Select this check box to enable generating temporary files on the hard disk when processing a large amount of data. This helps to prevent Job execution failure caused by memory overflow. With this check box selected, you need also to define: - Buffer size in memory: Select the number of rows that can be buffered in the memory before a temporary file is to be generated on the hard disk. - Directory for temp files: Set the location where the temporary files should be stored. Warning:
Make sure that you specify an existing directory for temporary files; otherwise your Job execution will fail. |
Ignore trailing zeros for BigDecimal |
Select this check box to ignore trailing zeros for BigDecimal data. |
tStatCatcher Statistics |
Select this check box to gather the Job processing metadata at a Job level as well as at each component level. |
Global Variables
Global Variables |
NB_UNIQUES: the number of unique rows. This is an After variable and it returns an integer. NB_DUPLICATES: the number of duplicate rows. This is an After variable and it returns an integer. ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide. |
Usage
Usage rule |
This component handles flow of data therefore it requires input and output, hence is defined as an intermediary step. |
tUniqRow properties for Apache Spark Batch
These properties are used to configure tUniqRow running in the Spark Batch Job framework.
The Spark Batch tUniqRow component belongs to the Processing family.
The component in this framework is available in all subscription-based Talend products with Big Data and Talend Data Fabric.
Basic settings
Schema and Edit schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
|
|
Built-In: You create and store the schema locally for this component only. |
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. |
Unique key |
In this area, select one or more columns to carry out deduplication on the particular column(s) - Select the Key attribute check box to carry out deduplication on all the columns - Select the Case sensitive check box to differentiate upper case and lower case |
Advanced settings
Only once each duplicated key |
Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates. |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs. |
Spark Connection |
In the Spark
Configuration tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
tUniqRow properties for Apache Spark Streaming
These properties are used to configure tUniqRow running in the Spark Streaming Job framework.
The Spark Streaming tUniqRow component belongs to the Processing family.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
Basic settings
Schema et Edit schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
|
|
Built-In: You create and store the schema locally for this component only. |
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. |
Unique key |
In this area, select one or more columns to carry out deduplication on the particular column(s) - Select the Key attribute check box to carry out deduplication on all the columns - Select the Case sensitive check box to differentiate upper case and lower case |
Advanced settings
Only once each duplicated key |
Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates. |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Streaming component Palette it belongs to, appears only when you are creating a Spark Streaming Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs. |
Spark Connection |
In the Spark
Configuration tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
tUniqRow MapReduce properties (deprecated)
These properties are used to configure tUniqRow running in the MapReduce Job framework.
The MapReduce tUniqRow component belongs to the Data Quality family.
The component in this framework is available in all subscription-based Talend products with Big Data and Talend Data Fabric.
The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.
Basic settings
Schema et Edit schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
This component offers the advantage of the dynamic schema feature. This allows you to retrieve unknown columns from source files or to copy batches of columns from a source without mapping each column individually. For further information about dynamic schemas, see Talend Studio User Guide. This dynamic schema feature is designed for the purpose of retrieving unknown columns of a table and is recommended to be used for this purpose only; it is not recommended for the use of creating tables. |
|
Built-In: You create and store the schema locally for this component only. |
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. |
Unique key |
In this area, select one or more columns to carry out deduplication on the particular column(s) - Select the Key attribute check box to carry out deduplication on all the columns - Select the Case sensitive check box to differentiate upper case and lower case |
Advanced settings
Only once each duplicated key |
Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates. |
Ignore trailing zeros for BigDecimal |
Select this check box to ignore trailing zeros for BigDecimal data. |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide. |
Usage
Usage rule |
In a Talend Map/Reduce Job, this component is used as an intermediate step and other components used along with it must be Map/Reduce components, too. They generate native Map/Reduce code that can be executed directly in Hadoop. For further information about a Talend Map/Reduce Job, see the sections describing how to create, convert and configure a Talend Map/Reduce Job of the Talend Open Studio for Big Data Getting Started Guide . For a scenario demonstrating a Map/Reduce Job using this component, see Deduplicating entries using Map/Reduce components. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs, and non Map/Reduce Jobs. |
Scenarios
Deduplicating entries
In this five-component Job, we will sort entries on an input name list, find out duplicated names, and display the unique names and the duplicated names on the Run console.
Setting up the Job
Procedure
- Drop a tFileInputDelimited, a tSortRow, a tUniqRow, and two tLogRow components from the Palette to the design workspace, and name the components as shown above.
- Connect the tFileInputDelimited component, the tSortRow component, and the tUniqRow component using Row > Main connections.
- Connect the tUniqRow component and the first tLogRow component using a Main > Uniques connection.
- Connect the tUniqRow component and the second tLogRow component using a Main > Duplicates connection.
Configuring the components
Procedure
Saving and executing the Job
Procedure
Deduplicating entries based on dynamic schema
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
In this use case, we will use a Job similar to the one in the scenario described earlier to deduplicate the input entries about several families, so that only one person per family stays on the name list. As all the components in this Job support the dynamic schema feature, we will leverage this feature to save the time of configuring individual columns of the schemas.
Setting up the Job
Procedure
- Drop these components from the Palette to the design workspace: tFileInputDelimited, tExtractDynamicFields, tUniqRow, tFileOutputDelimited, and tLogRow, and name the components as shown above to better identify their roles in the Job.
- Connect the component labelled People, the component labelled Split_Column, and the component labelled Deduplicate using Row > Main connections.
- Connect the component labelled Deduplicate and the component labelled Unique_Families using a Main > Uniques connection.
- Connect the component labelled Deduplicate and the component labelled Duplicated_Families using a Main > Duplicates connection.
Configuring the components
Procedure
Saving and executing the Job
Procedure
Deduplicating entries using Map/Reduce components
The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.
This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.
This scenario illustrates how to create a Talend Map/Reduce Job to deduplicate entries, that is to say, to use Map/Reduce components to generate Map/Reduce code and run the Job right in Hadoop.
Note that the Talend Map/Reduce components are available to subscription-based Big Data users only and this scenario can be replicated only with Map/Reduce components.
1;Harry;Ford;68;Albany
2;Franklin;Wilson;79;Juneau
3;Ulysses;Roosevelt;25;Harrisburg
4;Harry;Ford;48;Olympia
5;Martin;Reagan;75;Columbia
6;Woodrow;Roosevelt;63;Harrisburg
7;Grover;McKinley;98;Atlanta
8;John;Taft;93;Montpelier
9;Herbert;Johnson;85;Lincoln
10;Grover;McKinley;33;Lansing
Since Talend Studio allows you to convert a Job between its Map/Reduce and Standard (Non Map/Reduce) versions, you can convert the scenario explained earlier to create this Map/Reduce Job. This way, many components used can keep their original settings so as to reduce your workload in designing this Job.
Before starting to replicate this scenario, ensure that you have appropriate rights and permissions to access the Hadoop distribution to be used. Then proceed as follows: