Data matching with Talend tools - Cloud - 8.0

Version
Cloud
8.0
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components

What is data matching?

Data matching is the process that enables you to find records representing the same entity in a dataset.

General definition

Data matching enables you to:
  • Find duplicates, potential duplicates and non-duplicates in a data source
  • Analyze data and return weighted probabilities of matching
  • Merge identical or similar entries into a single entry; and
  • Reduce disparity across different data sources.

Record linkage

Record linkage consists of identifying records that refer to the same entity in a dataset.

Two types of data record linkage exist:
  • Deterministic record linkage, which is based on identifiers that match; and
  • Probabilistic record linkage, which is based on the probability that identifiers match.

What to do before matching?

Profiling data

Data profiling is the process of examining the data available in different data sources and collecting statistics and information about this data.

Data profiling helps assess the quality level of the data according to defined set goals.

Data quality issues can stem from many different sources including, legacy systems, data migrations, database modifications, human communication inconsistencies and countless other potential anomalies. Regardless of the source, data quality issues can impact the ability of business to use its data to make insightful decisions.

If data are of a poor quality, or managed in structures that cannot be integrated to meet the needs of the enterprise, business processes and decision-making suffer.

Compared to manual analysis techniques, data profiling technology improves the enterprise ability to meet the challenge of managing data quality and to address the data quality challenges faced during data migrations and data integrations.

Standardizing data

Standardizing data before trying to perform matching tasks is an essential step to improve matching accuracy.
Talend provides different ways to standardize data:
  • You can standardize data against indices. Synonyms are standardized or converted to the "master" words.

    For more information on available data synonym dictionaries, see the Talend Data Fabric Studio User Guide.

  • You can use address validation components to standardize address data against Experian QAS, Loqate and MelissaData validation tools. The addresses returned by these tools are consistent and variations in address representations are eliminated. As addresses are standardized, matching gets easier.

    For more information on the tQASBatchAddressRow, tLoqateAddressRow and tMelissaDataAddress components, see Address standardization.

    For more information on address validation components, see the online publication about the tQASBatchAddressRow, tLoqateAddressRow and tMelissaDataAddress components on Talend Help Center (https://help.talend.com).

  • You can use the tStandardizePhoneNumber component to standardize a phone number, based on the formatting convention of the country of origin.

    For more information on phone number standardization, see Phone number standardization.

    For more information on phone number standardization, see the online publication about the tStandardizePhoneNumber component on Talend Help Center (https://help.talend.com).

  • You can use other more generic components to transform your data and get more standardized records, such as tReplace, tReplaceList, tVerifyEmail, tExtractRegexFields or tMap.

How do you match?

The classical matching approach

The classical approach consists of sorting data into similar sized partitions which have the same attribute, choosing metrics and defining matching rules.

Blocking by partitions

Record linkage is a demanding task because each record must be compared to the other ones from the data set. To improve the efficiency of this task, the blocking technique is a required step most of the time.

Blocking consists of sorting data into similar sized partitions which have the same attribute. The objective is to restrict comparisons to the records grouped within the same partition.

To create efficient partitions, you need to find attributes which are unlikely to change, such as a person's first name or last name. By doing this, you improve the reliability of the blocking step and the computation speed of the task.

It is recommended to use the tGenKey component to generate blocking keys and to view the distribution of the blocks.

For more information on generating blocking keys, see Identification.

Choosing metrics and defining matching rules

After blocking data into similar sized group, you can create match rules and test them before using them in the tMatchGroup component.

For more information about creating a match analysis, see Talend Data Fabric Studio User Guide.

Matching functions in the tMatchGroup component

tMatchGroup helps you create groups of similar data records in any source of data including large volumes of data by using one or several match rules.

Each created group is made up of a master record and records similar to this master record. The matching functions used to compute similarity measures between similar records and the master record include the following ones:
  • Phonetic algorithms, such as Soundex or Metaphone, are used to match names.
  • The Levensthein distance calculates the minimum number of edits required to transform one string to another.
  • The Jaro distance matches processed entries according to spelling deviations.
  • The Jaro-Winkler distance is a variant of Jaro giving more importance to the beginning of the string.

For more information on how to use the tMatchGroup component in standard and Map/Reduce Jobs, , see Classical matching.

The Simple VSR Matcher and the T-Swoosh algorithms

You can choose between two algorithms when using the tMatchGroup component:
  • Simple VSR Matcher
  • T-Swoosh

For more information about match analyses, see "Create a match rule" on Talend Help Center.

When do records match?

Two records match when the following conditions are met:
  • When using the T-Swoosh algorithm, the score returned for each matching function must be higher than the threshold you set.
  • The global score, computed as a weighted score of the different matching functions, must be higher than the match threshold.

Multiple passes

In general, different partitioning schemes are necessary. This requires using sequentially tMatchGroup components to match data against different blocking keys.

For an example of how to match data through multiple passes, see Classical matching.

Working with the tRecordMatching component

tRecordMatching joins compared columns from the main flow with reference columns from the lookup flow. According to the matching strategy you define, tRecordMatching outputs the match data, the possible match data and the rejected data. When arranging your matching strategy, the user-defined matching scores are critical to determine the match level of the data of interest.

For more information about the tRecordMatching component, see Classical matching.

The machine learning approach

The machine learning approach is useful when you want to match very high volume of data.

The data matching process can be automated by making a model learn and predict matches.

The data matching process

The advantages of the machine learning approach over the classical approach are the following:

  • The different blocking mechanism permits faster and more scalable computation. In the machine learning approach, blocking is not partitioning: a record can belong to different blocks and the size of the block is clearly delimited, which may not be the case with the tGenKey component.
  • The rules learnt and stored by the machine learning model can be much more complex and less arbitrary than human-designed matching rules.
  • Configuring components is more simple. The machine learning model learns automatically matching distances and similarity threshold, among other things.
  1. The first step consists of pre-analyzing a data set using the tMatchPairing component. Unique records, exact match records, suspect match pairs and a sample of the suspect match pairs are outputted by the tMatchPairing component.

    For more examples, see Computing suspect pairs and writing a sample in Talend Data Stewardship and Computing suspect pairs and suspect sample from source data.

  2. The second step consists of labeling the suspect match pairs from the sample as "match" or "no-match" manually. You can leverage Talend Data Stewardship to make the labeling task easier.

    You can use more than two classes, for example “match”, “potential match” and “different”.

    For more information on handling grouping tasks to decide on relationship among pairs of records in Talend Data Stewardship, see Talend Data Stewardship Examples.

    For more information on grouping tasks in Talend Data Stewardship, see the online publication about handling grouping tasks to decide on relationship among pair of records on Talend Help Center (https://help.talend.com).

  3. The third step consists of submitting the suspect match pairs you labeled to the tMatchModel component for learning and outputting a classifier model.

    For examples of how to generate a matching model, see the scenarios.

    You can find examples of how to generate a matching model on Talend Help Center (https://help.talend.com).

  4. The fourth step consists of labeling suspect pairs for large data sets automatically using the model computed by tMatchModel with the tMatchPredict component.

    For an example of labeling suspect pairs with assigned labels, see the scenario .

    You can find an example of how to label suspect pairs with assigned labels on Talend Help Center (https://help.talend.com).

What is a good sample?

The sample should be well-balanced: the number of records in each class - "match" and "no match" - should be approximately the same. An imbalanced data sample yields an unsatisfactory model.

The sample should be diverse: the more diverse the examples in the sample are, the more effective the rules learnt by the model will be.

The sample should be the right size: if you have a large data set with millions of records, then a few hundreds or thousands of examples may be enough. If your data set contains less than 10 000 records, then the sample size should be between 1 and 10% of the full data set.

How does tMatchModel generate a model?

The machine learning algorithm computes different measures, which are called features, to get as much information as possible on the defined columns.

To generate the model, tMatchModel analyzes the data using the Random Forest algorithm. A random forest is a collection of decision trees used to solve a classification problem. In a decision tree, each node corresponds to a question about the features associated to the input data. A random forest grows many decision trees to improve the accuracy of the classification and to generate a model.

For more information on data matching on Apache Spark, see the properties of tMatchModel.

Surviving master records

You can use the tRuleSurvivorship component or Talend Data Stewardship to survive master records.

Merging records using tRuleSurvivorship

Once you estimated duplicates and possible duplicates that are grouped together, you can use the tRuleSurvivorship component to create a single representation for each group of duplicates using the best-of-breed data. This representation is called a survivor.

For an example of how to create a clean data set from the suspect pairs labeled by tMatchPredict and the unique rows computed by tMatchPairing, see Matching with machine learning.

You can find an example of how to create a clean data set from the suspect pairs labeled by tMatchPredict on Talend Help Center (https://help.talend.com).

Using Talend Data Stewardship for clerical review and merging records

You can add merging campaigns in Talend Data Stewardship to review and modify survivorship rules, create master records and merge data.

For further information on merging campaigns in Talend Data Stewardship, see Talend Data Stewardship Examples.

In Talend Data Stewardship, data stewards are business users in charge of resolving data stewardship tasks:
  • Classifying data by assigning a label chosen among a predefined list of arbitration choices.
  • Merging several potential duplicate records into one single record.

    Merging tasks allow authorized data stewards to merge several potential duplicate source records into one single record (golden record). The outcome of a merging task is the golden record produced by data stewards.

    For further information on merging tasks in Talend Data Stewardship, see Talend Data Stewardship Examples.

    For further information on merging tasks in Talend Data Stewardship, see the online publication about handling merging tasks on Talend Help Center (https://help.talend.com).

    Source records can come from the same source (database deduplication) or different sources (databases reconciliation).

How do you rematch using machine learning components?

Doing continuous matching

If you want to match new records against a clean data set, you do not need to restart the matching process from scratch.

You can reuse and index the clean set and to do continuous matching.

To be able to perform continuous matching tasks, Elasticsearch version 5.1.2+ must be running.

The continuous matching process is made up of the following steps:

  1. The first step consists of computing suffixes to separate clean and deduplicated records from a data set and indexing them in Elasticsearch using tMatchIndex.

    For an example of how to index a data in Elasticsearch using tMatchIndex, see this scenario.

    You can find an example of how to index a data in Elasticsearch using tMatchIndexon Talend Help Center (https://help.talend.com).

  2. The second step consists of comparing the indexed records with new records having the same schema and outputting matching and non-matching records using tMatchIndexPredict. This component uses the pairing and matching models generated by tMatchPairing and tMatchModel.

    For an example of how to matching new records against records from a reference dataset, see this scenario.

    You can find an example of how to do continuous matching using tMatchIndex on Talend Help Center (https://help.talend.com).

You can then clean and deduplicate the non-matching records using tRuleSurvivorship and populate the clean data set indexed in Elasticsearch using tMatchIndex.

Exact matching

The exact matching considers two records an exact match when a subset of their attributes is identical.

Component

tUniqRow

Ensures data quality of input or output flow in a Job.

tUniqRow compares entries and sorts out duplicate entries from the input flow.

This component is not shipped with your Talend Studio by default. You need to install it using the Feature Manager. For more information, see Installing features using the Feature Manager.

Depending on the Talend product you are using, this component can be used in one, some or all of the following Job frameworks:

tUniqRow Standard properties

These properties are used to configure tUniqRow running in the Standard Job framework.

The Standard tUniqRow component belongs to the Data Quality family.

The component in this framework is available in all Talend products.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion.

    If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.

This component offers the advantage of the dynamic schema feature. This allows you to retrieve unknown columns from source files or to copy batches of columns from a source without mapping each column individually. For further information about dynamic schemas, see Talend Studio User Guide.

This dynamic schema feature is designed for the purpose of retrieving unknown columns of a table and is recommended to be used for this purpose only; it is not recommended for the use of creating tables.

 

Built-In: You create and store the schema locally for this component only.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.

Unique key

In this area, select one or more columns to carry out deduplication on the particular column(s)

- Select the Key attribute check box to carry out deduplication on all the columns

- Select the Case sensitive check box to differentiate upper case and lower case

Advanced settings

Only once each duplicated key

Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates.

Use of disk (suitable for processing large row set)

Select this check box to enable generating temporary files on the hard disk when processing a large amount of data. This helps to prevent Job execution failure caused by memory overflow. With this check box selected, you need also to define:

- Buffer size in memory: Select the number of rows that can be buffered in the memory before a temporary file is to be generated on the hard disk.

- Directory for temp files: Set the location where the temporary files should be stored.

Warning:

Make sure that you specify an existing directory for temporary files; otherwise your Job execution will fail.

Ignore trailing zeros for BigDecimal

Select this check box to ignore trailing zeros for BigDecimal data.

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at a Job level as well as at each component level.

Global Variables

Global Variables

NB_UNIQUES: the number of unique rows. This is an After variable and it returns an integer.

NB_DUPLICATES: the number of duplicate rows. This is an After variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl+Space to access the variable list and choose the variable to use from it.

For more information about variables, see Talend Studio User Guide.

Usage

Usage rule

This component handles flow of data therefore it requires input and output, hence is defined as an intermediary step.

tUniqRow properties for Apache Spark Batch

These properties are used to configure tUniqRow running in the Spark Batch Job framework.

The Spark Batch tUniqRow component belongs to the Processing family.

The component in this framework is available in all subscription-based Talend products with Big Data and Talend Data Fabric.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion.

    If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.

 

Built-In: You create and store the schema locally for this component only.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.

Unique key

In this area, select one or more columns to carry out deduplication on the particular column(s)

- Select the Key attribute check box to carry out deduplication on all the columns

- Select the Case sensitive check box to differentiate upper case and lower case

Advanced settings

Only once each duplicated key

Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Spark Connection

In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
  • Yarn mode (Yarn client or Yarn cluster):
    • When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab.

    • When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab.

    • When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab.
    • When using on-premises distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration or tS3Configuration.

    If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem).

This connection is effective on a per-Job basis.

tUniqRow properties for Apache Spark Streaming

These properties are used to configure tUniqRow running in the Spark Streaming Job framework.

The Spark Streaming tUniqRow component belongs to the Processing family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Schema et Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion.

    If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.

 

Built-In: You create and store the schema locally for this component only.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.

Unique key

In this area, select one or more columns to carry out deduplication on the particular column(s)

- Select the Key attribute check box to carry out deduplication on all the columns

- Select the Case sensitive check box to differentiate upper case and lower case

Advanced settings

Only once each duplicated key

Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Streaming component Palette it belongs to, appears only when you are creating a Spark Streaming Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Spark Connection

In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
  • Yarn mode (Yarn client or Yarn cluster):
    • When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab.

    • When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab.

    • When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab.
    • When using on-premises distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration or tS3Configuration.

    If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem).

This connection is effective on a per-Job basis.

Scenarios

Deduplicating entries

In this five-component Job, we will sort entries on an input name list, find out duplicated names, and display the unique names and the duplicated names on the Run tab.

Setting up the Job

Procedure

  1. Drop a tFileInputDelimited, a tSortRow, a tUniqRow, and two tLogRow components from the Palette to the design workspace.
  2. Optional: To rename the components, double-click them.
  3. Connect the tFileInputDelimited component, the tSortRow component, and the tUniqRow component using Row > Main connections.
  4. Connect the tUniqRow component and the first tLogRow component using a Main > Uniques connection.
  5. Connect the tUniqRow component and the second tLogRow component using a Main > Duplicates connection.

Configuring the components

Procedure

  1. Double-click the tFileInputDelimited component to display its Basic settings view.
  2. Click the [...] button next to the File Name/Stream field to browse to your input file.
  3. Define the header and footer rows.
    In this example, the first row of the input file is the header row.
  4. Click Edit schema to define the schema for this component.
    1. Click the [+] button to add columns. In this example, the input file has five columns: Id, FirstName, LastName, Age, and City.
    2. Click OK.
  5. Double-click the tSortRow component to display its Basic settings view.
  6. To rearrange the entries in the alphabetic order of the names, add two rows in the Criteria table by clicking the [+] button:
    1. Select the FirstName and LastName columns under Schema column.
    2. Select alpha as the sorting type.
    3. Select asc the sorting order.
  7. Double-click the tUniqRow component to display its Basic settings view.
  8. In the Unique key area, select the columns on which you want deduplication to be carried out.
    In this example, you will sort out duplicated names.
  9. In the Basic settings view of the tLogRow components, select the Table option to view the Job execution result in table mode.

Saving and executing the Job

Procedure

  1. Save the Job.
  2. Run the Job by pressing F6 or clicking the Run button on the Run tab.
    The unique names and duplicated names are displayed in different tables.

Deduplicating entries based on dynamic schema

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

In this example, we will use a Job similar to the one in the scenario described earlier to deduplicate the input entries about several families, so that only one person per family stays on the name list. As all the components in this Job support the dynamic schema feature, we will leverage this feature to save the time of configuring individual columns of the schemas.

Setting up the Job

Procedure

  1. Drop these components from the Palette to the design workspace: tFileInputDelimited, tExtractDynamicFields, tUniqRow, tFileOutputDelimited, and tLogRow, and name the components as shown above to better identify their roles in the Job.
  2. Optional: To rename the components, double-click them.
  3. Connect the tFileInputDelimited component, the tExtractDynamicFields component, and the tUniqRow component using Row > Main connections.
  4. Connect the tUniqRow component and the first tLogRow component using a Main > Uniques connection.
  5. Connect the tUniqRow component and the second tLogRow component using a Main > Duplicates connection.

Configuring the components

Procedure

  1. Double-click the tFileInputDelimited component to display its Basic settings view.
    Important: The dynamic schema feature is only supported in Built-In mode and requires the input file to have a header row.
  2. Click the [...] button next to the File Name/Stream field to browse to your input file.
  3. Define the header and footer rows.
    In this example, the first row of the input file is the header row.
  4. Click Edit schema to define the schema for this component.
    In this example, the input file has five columns: FirstName, LastName, HouseNo, Street, and City. However, as we can leverage the advantage of the dynamic schema feature, we simply define one dynamic column in the schema, Dyna in this example.
    1. Add a new line by clicking the [+] button.
    2. Type Dyna in the Column field.
    3. Select Dynamic from the Type list.
    4. Click OK.
  5. Double-click the tExtractDynamicFields component to display its Basic settings view.
    We will use this component to split the dynamic column of the input schema into two columns, one for the first name and the other for the family related information. To do so:
    1. Click Edit schema to open the Schema dialog box.
    2. In the output panel, click the [+] button to add two columns for the output schema, and name them FirstName and FamilyInfo respectively.
    3. Select String from the Type list for the FirstName column.
      This will extract this column from the input schema to carry the first name of each person on the name list.
    4. Select Dynamic from the Type list for the FamilyInfo column.
      This column will carry the rest information of each person on the name list: the last name, house number, street and city, which all together will identify a family.
    5. Click OK to propagate the schema and close the Schema dialog box.
  6. Double-click the tUniqRow component to display its Basic settings view.
  7. In the Unique key area, select the Key attribute check box for the FamilyInfo column.
    This will carry out deduplication on the family information.
  8. Double-click the tFileOutputDelimited component to display its Basic settings view.
  9. Define the output file path and select the Include header check box.
  10. Leave the other settings as they are.
  11. In the Basic settings view of the tLogRow component, select the Table option to view the Job execution result in table mode.

Saving and executing the Job

Procedure

  1. Save the Job.
  2. Run the Job by pressing F6 or clicking the Run button on the Run tab.
    The output file is created with the information about the duplicated families. Only one person per family is in the list.

Fuzzy matching

The fuzzy matching enables you to determine the records that match partially.

Components

tFuzzyMatch

Compares a column from the main flow with a reference column from the lookup flow and outputs the main flow data displaying the distance.

This component is not shipped with your Talend Studio by default. You need to install it using the Feature Manager. For more information, see Installing features using the Feature Manager.

tFuzzyMatch Standard properties

These properties are used to configure tFuzzyMatch running in the Standard Job framework.

The Standard tFuzzyMatch component belongs to the Data Quality family.

The component in this framework is available in all Talend products.

Basic settings

Schema and Edit schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

Two read-only columns, Value and Match are added to the output schema automatically.

 

Built-in: The schema will be created and stored locally for this component only. Related topic: see Talend Studio User Guide.

 

Repository: The schema already exists and is stored in the Repository, hence can be reused in various projects and Job designs. Related topic: see Talend Studio User Guide.

Matching type

Select the relevant matching algorithm among:

Levenshtein: Based on the edit distance theory. It calculates the number of insertion, deletion or substitution required for an entry to match the reference entry.

Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow. It does not support Chinese characters.

Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. It does not support Chinese characters.

Min distance

(Levenshtein only) Set the minimum number of changes allowed to match the reference. If set to 0, only perfect matches are returned.

Max distance

(Levenshtein only) Set the maximum number of changes allowed to match the reference.

Matching column

Select the column of the main flow that needs to be checked against the reference (lookup) key column

Unique matching

Select this check box if you want to get the best match possible, in case several matches are available.

Matching item separator

In case several matches are available, all of them are displayed unless the unique match box is selected. Define the delimiter between all matches.

Advanced settings

tStatCatcher Statistics

Select this check box to collect log data at the component level.

Global Variables

Global Variables

NB_LINE: the number of rows read by an input component or transferred to an output component. This is an After variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl+Space to access the variable list and choose the variable to use from it.

For more information about variables, see Talend Studio User Guide.

Usage

Usage rule

This component is not startable (green background) and it requires two input components and an output component.

tFuzzyUniqRow

Compares columns in the input flow by using a defined matching method and collects the encountered duplicates.

This component is not shipped with your Talend Studio by default. You need to install it using the Feature Manager. For more information, see Installing features using the Feature Manager.

tFuzzyUniqRow Standard properties

These properties are used to configure tFuzzyUniqRow running in the Standard Job framework.

The Standard tFuzzyUniqRow component belongs to the Data Quality family.

This component is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

Basic settings

Schema and Edit schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

 

Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

Repository: You have already created the schema and stored it in the Repository. Thus, you can reuse it in various projects and job designs. Related topic: see Talend Studio User Guide.

Column

List of all columns in the input flow.

Key attribute

Select the check boxes next to the columns you want to check.

Matching type

Select the relevant matching algorithm from the list:

Exact Match: matches each processed entry to all possible reference entries with exactly the same value.

Levenshtein: Based on the edit distance theory. It calculates the number of insertion, deletion or substitution required for an entry to match the reference entry.

Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow. It does not support Chinese characters.

Double Metaphone: a new version of the Metaphone phonetic algorithm that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. It does not support Chinese characters.

Min. Distance

Only for Levenshtein. Set the minimum number of changes allowed to match the reference. If set to 0, only perfect matches(Exact Match) are returned.

Max. Distance

Only for Levenshtein. Set the maximum number of changes allowed to match the reference.

Advanced settings

tStat Catcher Statistics

Select this check box to collect log data at the component level.

Global Variables

Global Variables

NB_UNIQUES: the number of unique rows. This is an After variable and it returns an integer.

NB_DUPLICATES: the number of duplicate rows. This is an After variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl+Space to access the variable list and choose the variable to use from it.

For more information about variables, see Talend Studio User Guide.

Usage

Usage rule

This component is not startable (green background) and it requires an input component and two output components.

tBlockedFuzzyJoin

Helps ensuring the data quality of any source data against a reference data source.

tBlockedFuzzyJoin joins two tables by doing a fuzzy match on several columns. It compares columns from the main flow with reference columns from the lookup flow and outputs the match data, the possible match data and the rejected data.

tBlockedFuzzyJoin Standard properties

These properties are used to configure tBlockedFuzzyJoin running in the Standard Job framework.

The Standard tBlockedFuzzyJoin component belongs to the Data Quality family.

This component is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

Basic settings

Schema and Edit Schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

 

Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide .

 

Repository: You have already created and stored the schema in the Repository. You can reuse it in other projects and Job designs. Related topic: see Talend Studio User Guide .

Replace output column with lookup column if matches or possible matches

Select this check box to replace the output column with the lookup column in case of match or possible match values.

Input key attribute

Select the column(s) from the main flow that needs to be checked against the reference (lookup) key column.

Lookup key attribute

Select the lookup key columns that you will use as a reference against which to compare the columns from the input flow.

Matching type

Select the relevant matching algorithm from the list:

Exact Match: matches each processed entry to all possible reference entries that have exactly the same value.

Levenshtein: Based on the edit distance theory. It calculates the number of insertion, deletion or substitution required for an entry to match the reference entry.

Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow.

Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry.

Case sensitive

Select this check box to consider the letter case.

Min. distance

Only for Levenshtein. Set the minimum number of changes allowed to match the reference. If set to 0, only perfect matches(Exact Match) are returned.

Note:

You can create and store context variables for the minimum and maximum distances and then have your Job to loop on these values in order to start from a low max number to match rows and go up to higher max number to match more possible rows. You can press Ctrl+Space to access the variable list and select the new context variables. For more information about context variables, see Talend Studio User Guide.

Max. distance

Only for Levenshtein. Set the maximum number of changes allowed to match the reference.

Advanced settings

tStat Catcher Statistics

Select this check box to collect log data at the component level.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl+Space to access the variable list and choose the variable to use from it.

For more information about variables, see Talend Studio User Guide.

Usage

Usage rule

This component is deprecated, use the tRecordMatching component instead.

This component is not startable (green background) and it requires two input components and one or more output components.

tFuzzyJoin

Joins two tables by doing a fuzzy match on several columns, comparing columns from the main flow with reference columns from the lookup flow and outputting the main flow data and/or the rejected data.

tFuzzyJoin Standard properties

These properties are used to configure tFuzzyJoin running in the Standard Job framework.

The Standard tFuzzyJoin component belongs to the Data Quality family.

This component is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

Basic settings

Schema and Edit Schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

 

Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

Repository: You have already created and stored the schema in the Repository. You can reuse it in other projects and Job designs. Related topic: see Talend Studio User Guide.

Include lookup columns in output

Select this check box to include the lookup columns you define in the output flow.

Input key attribute

Select the column(s) from the main flow that needs to be checked against the reference (lookup) key column.

Lookup key attribute

Select the lookup key columns that you will use as a reference against which to compare the columns from the input flow.

Matching type

Select the relevant matching algorithm from the list:

Exact Match: matches each processed entry to all possible reference entries with exactly the same value.

Levenshtein: Based on the edit distance theory. It calculates the number of insertion, deletion or substitution required for an entry to match the reference entry.

Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow.

Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry.

Case sensitive

Select this check box to consider the letter case.

Min. distance

Only for Levenshtein. Set the minimum number of changes allowed to match the reference. If set to 0, only perfect matches(Exact Match) are returned.

Note:

You can create and store context variables for the minimum and maximum distances in order to start from a low max number to match rows and go up to higher max number to match more possible rows. You can press Ctrl+Space to access the variable list and select the new context variables. For more information about context variables, see Talend Studio User Guide.

Max. distance

Only for Levenshtein. Set the maximum number of changes allowed to match the reference.

Inner join (with reject output)

Select this check box to join the two tables first and gather the rejected data from the main flow.

Advanced settings

tStat Catcher Statistics

Select this check box to collect log data at the component level.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl+Space to access the variable list and choose the variable to use from it.

For more information about variables, see Talend Studio User Guide.

Usage

Usage rule

This component is deprecated, use the tRecordMatching component instead.

This component is not startable and it requires two input components and one or more output components.

Scenarios

Checking the Levenshtein distance of 0 in first names

This scenario describes a four-component Job aiming at checking the edit distance between the First Name column of an input file with the data of the reference input file. The output of this Levenshtein type check is displayed along with the content of the main flow on a table.

Setting up the Job

Procedure

  1. Drag and drop the following components from the Palette to the design workspace: two tFixedFlowInput components, tFuzzyMatch, tLogRow.
  2. Link the first tFixedFlowInput component to the tFuzzyMatch component using a Row > Main connection.
  3. Link the second tFixedFlowInput component to the tFuzzyMatch using a Row > Main connection (which appears as a Lookup row on the design workspace).
  4. Link the tFuzzyMatch component to the standard output tLogRow using a Row > Main connection.

Configuring the components

Procedure

  1. Define the first tFixedFlowInput in its Basic settings view.
    In this example, you use the following input:
    FirstName;Name
    Brad;Los angeles
    Jason;New York
    Margaret;
    Kourtney;Seattle
    Nicole;Saint-Louis
    John;Denver
  2. Define the schema of the component. In this example, the input schema has two columns: FirstName and City.
  3. Define the second tFixedFlowInput.
    In this example, you use the following input:
    FirstName;City
    Brad;Los Angeles
    Jason;New York
    Margaret;Dallas
    Courtney;Seattle
    Nicole;Saint-Louis
    Jon;Denver
  4. Set the reference column as key column in the schema of the lookup flow.
  5. Double-click the tFuzzyMatch component to open its Basic settings view, and check its schema.
    The Schema should match the Main input flow schema in order for the main flow to be checked against the reference.
    Note that two columns, Value and Matching, are added to the output schema. These are standard matching information and are read-only.
  6. Select the method to be used to check the incoming data. In this example, Levenshtein is the Matching type to be used.
  7. Set the distance.
    In this method, the distance is the number of char changes (insertion, deletion, or substitution) that needs to be carried out in order for the entry to fully match the reference.
    In this example, you set both the minimum distance and the maximum distance to 0. This means only the exact matches will be output.
  8. Clear the Case sensitive check box.
  9. Select the matching column and look up column. The first name in this example.
  10. Leave the other parameters as default.

Executing the Job

Procedure

Save the Job and press F6 to execute the Job.
Brad|Los angeles|0|Brad
Jason|New York|0|Jason
Margaret||0|Margaret
Kourtney|Seattle||
Nicole|Saint-Louis|0|Nicole
John|Denver||

Results

As the edit distance has been set to 0 (min and max), the output shows the result of a regular join between the main flow and the lookup (reference) flow, hence only full matches with Value of 0 are displayed.

A more obvious example is with a minimum distance of 1 and a maximum distance of 2, see the scenario.

Checking the Levenshtein distance of 1 or 2 in first names

This scenario is based on the scenario described above. Only the minimum and maximum distance settings in the tFuzzyMatch component are modified, which will change the output displayed.

Procedure

  1. In the Component view of the tFuzzyMatch, change the minimum distance from 0 to 1. This excludes straight away the exact matches (which would show a distance of 0).
  2. Change also the maximum distance to 2. The output will provide all matching entries showing a discrepancy of 2 characters at most.
    No other changes are required.
  3. Define the Matching item separator field, as several references might be matching the main flow entry.
  4. Save the new Job and press F6 to run it.
    FirstName|Name||
    Brad|Los angeles||
    Jason|New York|2|Jon
    Margaret|||
    Kourtney|Seattle|1|Courtney
    Nicole|Saint-Louis||
    John|Denver|1|Jon
    As the edit distance has been set to 2, some entries of the main flow match more than one reference entry.

Results

You can also use another method, the metaphone, to assess the distance between the main flow and the reference, which will be described in the next scenario.

Checking the Metaphonic distance in first name

Procedure

  1. Change the Matching type to Metaphone.
    There is no minimum nor maximum distance to set as the matching method is based on the discrepancies with the phonetics of the reference.
  2. Save the Job and press F6. The phonetics value is displayed along with the possible matches.
    Brad|Los angeles|BRT|Brad
    Jason|New York|JSN|Jason
    Margaret||MRKR|Margaret
    Kourtney|Seattle|KRTN|Courtney
    Nicole|Saint-Louis|NKL|Nicole
    John|Denver|JN|Jon

Comparing four columns using different matching methods and collecting encountered duplicates

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a four-component Job aiming at collecting in two separate files all unique entries and all duplicate entries from few defined processed columns based on the Levenshtein and Double Metaphone matching types.

The input file in this example looks like the following:

ID;Status;FirstName;Email;City;Initial;ZipCode
1;married;Paul;pnewman@comp.com;New York;P.N.;55677
2;single;Raul;rnewman@comp.com;New Ork;R.N.;55677
3;single;Mary;mnewman@comp.com;Chicago;M.N;66898

Setting up the Job

Procedure

  1. Drop tFileInputDelimited, tFuzzyUniqRow, and two tFileOutputExcel components from the Palette onto the design workspace.
  2. Connect tFileInputDelimited to tFuzzyUniqRow using the Main link, tFuzzyUniqRow to the tFileOutputExcel components using the Uniques link with one and the Duplicates link with the other.

Configuring the components

Procedure

  1. Double-click tFileInputDelimited to open its Basic settings view and define its properties.
  2. Click the [...] button next to the File Name field to browse to the file holding the input data.
  3. If needed, set Header, Footer, and Limit.
    For this scenario, set Header to 1. Footer and limit for the number of processed rows are not set.
  4. Click Edit schema to open a dialog box where you can describe the data structure of the source delimited file.
    In this scenario, the source schema is made of the following columns: ID, Status, FirstName, Email, City, Initial, and ZipCode.
  5. Double-click tFuzzyUniqRow to display its Basic settings view and define its properties.
  6. In the Key Attribute column, select the check boxes next to the columns you want to check using the defined matching method, Firstname, Email, City, and ZipCode in this example.
  7. In the Matching Type column, set the matching methods you want to use on each of the selected columns.
    In this example, Leveshtein is to be used as the matching method for the FirstName, Email, and ZipCode columns, Double Metaphone is to be used as the matching method for the City column.
    Then set the minimum and maximum distances for the Levenshtein method. In this method, the distance is the number of character changes (insertion, deletion or substitution) that needs to be carried out in order for the entry to fully match the reference. In this example, we want the min. distance to be 0 and the max. distance to be 2. This will output all entries in the FirstName, Email, and ZipCode columns that exactly match or that have maximum two character changes. There is no minimum nor maximum distance to set for Double Metaphone because this matching method is based on phonetic discrepancies in the input data.
  8. Double-click the first tFileOutputExcel to display its Basic settings view and define its properties.
  9. Set the destination file name as well as the Sheet name and select the Include header check box.
  10. Do the same for the second tFileOutputExcel.

Executing the Job

Procedure

Save your Job and click F6 to execute it.

Results

tFuzzyUniqRow uses the Levenshtein method to compare each of the three defined columns separately, it uses the Double Metaphone method to compare data in the City column, and finally passes the unique and duplicate rows to the defined output files. In our example, the first two rows match, hence the second row will go in the "duplicates" output.

The generated FID column gives a reference identifier of the original record which the current record refers to.

The third row is unique and will go in the "uniques" output.

The generated UID column is an identifier generated for the main record.

Doing a fuzzy match on two columns and outputting the match, possible match and non match values

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a six-component Job that aims at:

  • matching each processed group number in the grp column against the entries that have exactly the same values in the reference input file,

  • checking the edit distance between the entries in the firstname column of an input file against those of the reference input file.

The outputs of these two matching types are written in three output files: the first for match values, the second for possible match values and the third for the values for which there are no matches in the lookup file.

In this scenario, we have already stored the main and reference input schemas in the Repository. For more information about storing schema metadata in the Repository, see Talend Studio User Guide.

The main input file contains four columns: grp, gender, firstname and count. The data in this input file have problems such as duplication, first names spelled differently or wrongly, different information for the same customer.

Setting up the Job

Procedure

  1. In the Repository tree view, expand Metadata and the FileExcel node where you have stored the main input schemas and then drop it onto the design workspace.

    A tFileInputExcel component holding your schema displays on the workspace.

  2. Do the same with the reference input schema to display the reference tFileInputExcel on the workspace.
  3. Drop a tBlockedFuzzyJoin and tLogRow (x3) from the Palette onto the design workspace.
  4. Connect the main and reference input Excel files to tBlockedFuzzyJoin using Main links. The link between the reference input Excel file and tBlockedFuzzyJoin displays as a Lookup link on the design workspace.
  5. Connect tBlockedFuzzyJoin to the three tLogrow components using the Matches, Possible Matches and Non Matches links.

Configuring the input components

Procedure

Double-click the main and reference input Excel file components to display their Basic settings views.

The capture below shows the properties of the main input file.

The capture below shows the properties of the reference input file.