Data matching with Talend tools - 7.3

EnrichVersion
Cloud
7.3
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
EnrichPlatform
Talend Studio
task
Data Governance > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Data matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Governance > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Continuous matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Data matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Data Quality and Preparation > Third-party systems > Data Quality components > Matching components > Matching with machine learning components
Design and Development > Third-party systems > Data Quality components > Matching components > Continuous matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Data matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Fuzzy matching components
Design and Development > Third-party systems > Data Quality components > Matching components > Matching with machine learning components

What is data matching?

Data matching is the process that enables you to find records representing the same entity in a dataset.

General definition

Data matching enables you to:
  • Find duplicates, potential duplicates and non-duplicates in a data source
  • Analyze data and return weighted probabilities of matching
  • Merge identical or similar entries into a single entry; and
  • Reduce disparity across different data sources.

Record linkage

Record linkage consists of identifying records that refer to the same entity in a dataset.

Two types of data record linkage exist:
  • Deterministic record linkage, which is based on identifiers that match; and
  • Probabilistic record linkage, which is based on the probability that identifiers match.

What to do before matching?

Profiling data

Data profiling is the process of examining the data available in different data sources and collecting statistics and information about this data.

Data profiling helps assess the quality level of the data according to defined set goals.

Data quality issues can stem from many different sources including, legacy systems, data migrations, database modifications, human communication inconsistencies and countless other potential anomalies. Regardless of the source, data quality issues can impact the ability of business to use its data to make insightful decisions.

If data are of a poor quality, or managed in structures that cannot be integrated to meet the needs of the enterprise, business processes and decision-making suffer.

Compared to manual analysis techniques, data profiling technology improves the enterprise ability to meet the challenge of managing data quality and to address the data quality challenges faced during data migrations and data integrations.

Standardizing data

Standardizing data before trying to perform matching tasks is an essential step to improve matching accuracy.
Talend provides different ways to standardize data:
  • You can standardize data against indices. Synonyms are standardized or converted to the "master" words.

    For more information on available data synonym dictionaries, see the Talend Data Fabric Studio User Guide.

  • You can use address validation components to standardize address data against Experian QAS, Loqate and MelissaData validation tools. The addresses returned by these tools are consistent and variations in address representations are eliminated. As addresses are standardized, matching gets easier.

    For more information on the tQASBatchAddressRow, tLoqateAddressRow and tMelissaDataAddress components, see Address standardization.

  • You can use the tStandardizePhoneNumber component to standardize a phone number, based on the formatting convention of the country of origin.

    For more information on phone number standardization, see Phone number standardization.

  • You can use other more generic components to transform your data and get more standardized records, such as tReplace, tReplaceList, tVerifyEmail, tExtractRegexFields or tMap.

How do you match?

The classical matching approach

The classical approach consists of sorting data into similar sized partitions which have the same attribute, choosing metrics and defining matching rules.

Blocking by partitions

Record linkage is a demanding task because each record must be compared to the other ones from the data set. To improve the efficiency of this task, the blocking technique is a required step most of the time.

Blocking consists of sorting data into similar sized partitions which have the same attribute. The objective is to restrict comparisons to the records grouped within the same partition.

To create efficient partitions, you need to find attributes which are unlikely to change, such as a person's first name or last name. By doing this, you improve the reliability of the blocking step and the computation speed of the task.

It is recommended to use the tGenKey component to generate blocking keys and to view the distribution of the blocks.

For more information on generating blocking keys, see Identification.

Choosing metrics and defining matching rules

After blocking data into similar sized group, you can create match rules and test them before using them in the tMatchGroup component.

For more information about creating a match analysis, see Talend Data Fabric Studio User Guide.

Matching functions in the tMatchGroup component

tMatchGroup helps you create groups of similar data records in any source of data including large volumes of data by using one or several match rules.

Each created group is made up of a master record and records similar to this master record. The matching functions used to compute similarity measures between similar records and the master record include the following ones:
  • Phonetic algorithms, such as Soundex or Metaphone, are used to match names.
  • The Levensthein distance calculates the minimum number of edits required to transform one string to another.
  • The Jaro distance matches processed entries according to spelling deviations.
  • The Jaro-Winkler distance is a variant of Jaro giving more importance to the beginning of the string.

For more information on how to use the tMatchGroup component in standard and Map/Reduce Jobs, , see Classical matching.

The Simple VSR Matcher and the T-Swoosh algorithms

You can choose between two algorithms when using the tMatchGroup component:
  • Simple VSR Matcher
  • T-Swoosh

For more information about match analyses, see "Create a match rule" on Talend Help Center.

When do records match?

Two records match when the following conditions are met:
  • When using the T-Swoosh algorithm, the score returned for each matching function must be higher than the threshold you set.
  • The global score, computed as a weighted score of the different matching functions, must be higher than the match threshold.

Multiple passes

In general, different partitioning schemes are necessary. This requires using sequentially tMatchGroup components to match data against different blocking keys.

For an example of how to match data through multiple passes, see Classical matching.

Working with the tRecordMatching component

tRecordMatching joins compared columns from the main flow with reference columns from the lookup flow. According to the matching strategy you define, tRecordMatching outputs the match data, the possible match data and the rejected data. When arranging your matching strategy, the user-defined matching scores are critical to determine the match level of the data of interest.

For more information about the tRecordMatching component, see Classical matching.

The machine learning approach

The machine learning approach is useful when you want to match very high volume of data.

The data matching process can be automated by making a model learn and predict matches.

The data matching process

The advantages of the machine learning approach over the classical approach are the following:

  • The different blocking mechanism permits faster and more scalable computation. In the machine learning approach, blocking is not partitioning: a record can belong to different blocks and the size of the block is clearly delimited, which may not be the case with the tGenKey component.
  • The rules learnt and stored by the machine learning model can be much more complex and less arbitrary than human-designed matching rules.
  • Configuring components is more simple. The machine learning model learns automatically matching distances and similarity threshold, among other things.
  1. The first step consists of pre-analyzing a data set using the tMatchPairing component. Unique records, exact match records, suspect match pairs and a sample of the suspect match pairs are outputted by the tMatchPairing component.

    For more examples, see Computing suspect pairs and writing a sample in Talend Data Stewardship and Computing suspect pairs and suspect sample from source data.

  2. The second step consists of labeling the suspect match pairs from the sample as "match" or "no-match" manually. You can leverage Talend Data Stewardship to make the labeling task easier.

    You can use more than two classes, for example “match”, “potential match” and “different”.

    For more information on handling grouping tasks to decide on relationship among pairs of records in Talend Data Stewardship, see Talend Data Stewardship Examples.

  3. The third step consists of submitting the suspect match pairs you labeled to the tMatchModel component for learning and outputting a classifier model.

    For examples of how to generate a matching model, see the scenarios.

  4. The fourth step consists of labeling suspect pairs for large data sets automatically using the model computed by tMatchModel with the tMatchPredict component.

    For an example of labeling suspect pairs with assigned labels, see the scenario .

What is a good sample?

The sample should be well-balanced: the number of records in each class - "match" and "no match" - should be approximately the same. An imbalanced data sample yields an unsatisfactory model.

The sample should be diverse: the more diverse the examples in the sample are, the more effective the rules learnt by the model will be.

The sample should be the right size: if you have a large data set with millions of records, then a few hundreds or thousands of examples may be enough. If your data set contains less than 10 000 records, then the sample size should be between 1 and 10% of the full data set.

How does tMatchModel generate a model?

The machine learning algorithm computes different measures, which are called features, to get as much information as possible on the defined columns.

To generate the model, tMatchModel analyzes the data using the Random Forest algorithm. A random forest is a collection of decision trees used to solve a classification problem. In a decision tree, each node corresponds to a question about the features associated to the input data. A random forest grows many decision trees to improve the accuracy of the classification and to generate a model.

For more information on data matching on Apache Spark, see the properties of tMatchModel.

Surviving master records

You can use the tRuleSurvivorship component or Talend Data Stewardship to survive master records.

Merging records using tRuleSurvivorship

Once you estimated duplicates and possible duplicates that are grouped together, you can use the tRuleSurvivorship component to create a single representation for each group of duplicates using the best-of-breed data. This representation is called a survivor.

For an example of how to create a clean data set from the suspect pairs labeled by tMatchPredict and the unique rows computed by tMatchPairing, see Matching with machine learning.

Using Talend Data Stewardship for clerical review and merging records

You can add merging campaigns in Talend Data Stewardship to review and modify survivorship rules, create master records and merge data.

For further information on merging campaigns in Talend Data Stewardship, see Talend Data Stewardship Examples.

In Talend Data Stewardship, data stewards are business users in charge of resolving data stewardship tasks:
  • Classifying data by assigning a label chosen among a predefined list of arbitration choices.
  • Merging several potential duplicate records into one single record.

    Merging tasks allow authorized data stewards to merge several potential duplicate source records into one single record (golden record). The outcome of a merging task is the golden record produced by data stewards.

    For further information on merging tasks in Talend Data Stewardship, see Talend Data Stewardship Examples.

    Source records can come from the same source (database deduplication) or different sources (databases reconciliation).

How do you rematch using machine learning components?

Doing continuous matching

If you want to match new records against a clean data set, you do not need to restart the matching process from scratch.

You can reuse and index the clean set and to do continuous matching.

To be able to perform continuous matching tasks, Elasticsearch version 5.1.2+ must be running.

The continuous matching process is made up of the following steps:

  1. The first step consists of computing suffixes to separate clean and deduplicated records from a data set and indexing them in Elasticsearch using tMatchIndex.

    For an example of how to index a data in Elasticsearch using tMatchIndex, see this scenario.

  2. The second step consists of comparing the indexed records with new records having the same schema and outputting matching and non-matching records using tMatchIndexPredict. This component uses the pairing and matching models generated by tMatchPairing and tMatchModel.

    For an example of how to matching new records against records from a reference dataset, see this scenario.

You can then clean and deduplicate the non-matching records using tRuleSurvivorship and populate the clean data set indexed in Elasticsearch using tMatchIndex.

Exact matching

The exact matching considers two records an exact match when a subset of their attributes is identical.

Component

tUniqRow

Ensures data quality of input or output flow in a Job.

tUniqRow compares entries and sorts out duplicate entries from the input flow.

tUniqRow Standard properties

These properties are used to configure tUniqRow running in the Standard Job framework.

The Standard tUniqRow component belongs to the Data Quality family.

The component in this framework is available in all Talend products.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.

This component offers the advantage of the dynamic schema feature. This allows you to retrieve unknown columns from source files or to copy batches of columns from a source without mapping each column individually. For further information about dynamic schemas, see Talend Studio User Guide.

This dynamic schema feature is designed for the purpose of retrieving unknown columns of a table and is recommended to be used for this purpose only; it is not recommended for the use of creating tables.

 

Built-In: You create and store the schema locally for this component only.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.

Unique key

In this area, select one or more columns to carry out deduplication on the particular column(s)

- Select the Key attribute check box to carry out deduplication on all the columns

- Select the Case sensitive check box to differentiate upper case and lower case

Advanced settings

Only once each duplicated key

Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates.

Use of disk (suitable for processing large row set)

Select this check box to enable generating temporary files on the hard disk when processing a large amount of data. This helps to prevent Job execution failure caused by memory overflow. With this check box selected, you need also to define:

- Buffer size in memory: Select the number of rows that can be buffered in the memory before a temporary file is to be generated on the hard disk.

- Directory for temp files: Set the location where the temporary files should be stored.

Warning:

Make sure that you specify an existing directory for temporary files; otherwise your Job execution will fail.

Ignore trailing zeros for BigDecimal

Select this check box to ignore trailing zeros for BigDecimal data.

tStatCatcher Statistics

Select this check box to gather the job processing metadata at a job level as well as at each component level.

Global Variables

Global Variables

NB_UNIQUES: the number of unique rows. This is an After variable and it returns an integer.

NB_DUPLICATES: the number of duplicate rows. This is an After variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend StudioUser Guide.

Usage

Usage rule

This component handles flow of data therefore it requires input and output, hence is defined as an intermediary step.

tUniqRow properties for Apache Spark Batch

These properties are used to configure tUniqRow running in the Spark Batch Job framework.

The Spark Batch tUniqRow component belongs to the Processing family.

The component in this framework is available in all subscription-based Talend products with Big Data and Talend Data Fabric.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.

 

Built-In: You create and store the schema locally for this component only.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.

Unique key

In this area, select one or more columns to carry out deduplication on the particular column(s)

- Select the Key attribute check box to carry out deduplication on all the columns

- Select the Case sensitive check box to differentiate upper case and lower case

Advanced settings

Only once each duplicated key

Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Spark Connection

In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
  • Yarn mode (Yarn client or Yarn cluster):
    • When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab.

    • When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab.

    • When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab.
    • When using Qubole, add a tS3Configuration to your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster.
    • When using on-premise distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration or tS3Configuration.

    If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem).

This connection is effective on a per-Job basis.

tUniqRow properties for Apache Spark Streaming

These properties are used to configure tUniqRow running in the Spark Streaming Job framework.

The Spark Streaming tUniqRow component belongs to the Processing family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Schema et Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.

 

Built-In: You create and store the schema locally for this component only.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.

Unique key

In this area, select one or more columns to carry out deduplication on the particular column(s)

- Select the Key attribute check box to carry out deduplication on all the columns

- Select the Case sensitive check box to differentiate upper case and lower case

Advanced settings

Only once each duplicated key

Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Streaming component Palette it belongs to, appears only when you are creating a Spark Streaming Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Spark Connection

In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
  • Yarn mode (Yarn client or Yarn cluster):
    • When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab.

    • When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab.

    • When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab.
    • When using Qubole, add a tS3Configuration to your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster.
    • When using on-premise distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration or tS3Configuration.

    If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem).

This connection is effective on a per-Job basis.

tUniqRow MapReduce properties (deprecated)

These properties are used to configure tUniqRow running in the MapReduce Job framework.

The MapReduce tUniqRow component belongs to the Data Quality family.

The component in this framework is available in all subscription-based Talend products with Big Data and Talend Data Fabric.

The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.

Basic settings

Schema et Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.

This component offers the advantage of the dynamic schema feature. This allows you to retrieve unknown columns from source files or to copy batches of columns from a source without mapping each column individually. For further information about dynamic schemas, see Talend Studio User Guide.

This dynamic schema feature is designed for the purpose of retrieving unknown columns of a table and is recommended to be used for this purpose only; it is not recommended for the use of creating tables.

 

Built-In: You create and store the schema locally for this component only.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.

Unique key

In this area, select one or more columns to carry out deduplication on the particular column(s)

- Select the Key attribute check box to carry out deduplication on all the columns

- Select the Case sensitive check box to differentiate upper case and lower case

Advanced settings

Only once each duplicated key

Select this check box if you want to have only the first duplicated entry in the column(s) defined as key(s) sent to the output flow for duplicates.

Ignore trailing zeros for BigDecimal

Select this check box to ignore trailing zeros for BigDecimal data.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend StudioUser Guide.

Usage

Usage rule

In a Talend Map/Reduce Job, this component is used as an intermediate step and other components used along with it must be Map/Reduce components, too. They generate native Map/Reduce code that can be executed directly in Hadoop.

For further information about a Talend Map/Reduce Job, see the sections describing how to create, convert and configure a Talend Map/Reduce Job of the Talend Open Studio for Big Data Getting Started Guide .

For a scenario demonstrating a Map/Reduce Job using this component, see Deduplicating entries using Map/Reduce components.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs, and non Map/Reduce Jobs.

Scenarios

Deduplicating entries

In this five-component Job, we will sort entries on an input name list, find out duplicated names, and display the unique names and the duplicated names on the Run console.

Setting up the Job

Procedure

  1. Drop a tFileInputDelimited, a tSortRow, a tUniqRow, and two tLogRow components from the Palette to the design workspace, and name the components as shown above.
  2. Connect the tFileInputDelimited component, the tSortRow component, and the tUniqRow component using Row > Main connections.
  3. Connect the tUniqRow component and the first tLogRow component using a Main > Uniques connection.
  4. Connect the tUniqRow component and the second tLogRow component using a Main > Duplicates connection.

Configuring the components

Procedure

  1. Double-click the tFileInputDelimited component to display its Basic settings view.
  2. Click the [...] button next to the File Name field to browse to your input file.
  3. Define the header and footer rows. In this use case, the first row of the input file is the header row.
  4. Click Edit schema to define the schema for this component. In this use case, the input file has five columns: Id, FirstName, LastName, Age, and City. Then click OK to propagate the schema and close the schema editor.
  5. Double-click the tSortRow component to display its Basic settings view.
  6. To rearrange the entries in the alphabetic order of the names, add two rows in the Criteria table by clicking the plus button, select the FirstName and LastName columns under Schema column, select alpha as the sorting type, and select the sorting order.
  7. Double-click the tUniqRow component to display its Basic settings view.
  8. In the Unique key area, select the columns on which you want deduplication to be carried out. In this use case, you will sort out duplicated names.
  9. In the Basic settings view of each of the tLogRow components, select the Table option to view the Job execution result in table mode.

Saving and executing the Job

Procedure

  1. Press Ctrl+S to save your Job.
  2. Run the Job by pressing F6 or clicking the Run button on the Run tab.
    The unique names and duplicated names are displayed in different tables on the Run console.

Deduplicating entries based on dynamic schema

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

In this use case, we will use a Job similar to the one in the scenario described earlier to deduplicate the input entries about several families, so that only one person per family stays on the name list. As all the components in this Job support the dynamic schema feature, we will leverage this feature to save the time of configuring individual columns of the schemas.

Setting up the Job

Procedure

  1. Drop these components from the Palette to the design workspace: tFileInputDelimited, tExtractDynamicFields, tUniqRow, tFileOutputDelimited, and tLogRow, and name the components as shown above to better identify their roles in the Job.
  2. Connect the component labelled People, the component labelled Split_Column, and the component labelled Deduplicate using Row > Main connections.
  3. Connect the component labelled Deduplicate and the component labelled Unique_Families using a Main > Uniques connection.
  4. Connect the component labelled Deduplicate and the component labelled Duplicated_Families using a Main > Duplicates connection.

Configuring the components

Procedure

  1. Double-click the component labelled People to display its Basic settings view.
    Warning:

    The dynamic schema feature is only supported in Built-In mode and requires the input file to have a header row.

  2. Click the [...] button next to the File Name/Stream field to browse to your input file.
  3. Define the header and footer rows. In this use case, the first row of the input file is the header row.
  4. Click Edit schema to define the schema for this component.
    In this use case, the input file has five columns: FirstName, LastName, HouseNo, Street, and City. However, as we can leverage the advantage of the dynamic schema feature, we simply define one dynamic column in the schema, Dyna in this example.
    To do so :
    1. Add a new line by clicking the [+] button.
    2. Type Dyna in the Column field.
    3. Select Dynamic from the Type list.
    4. Then, click OK to propagate the schema and close the Schema dialog box.
  5. Double-click the component labelled Split_Column to display its Basic settings view.
    We will use this component to split the dynamic column of the input schema into two columns, one for the first name and the other for the family related information. To do so:
    1. Click Edit schema to open the Schema dialog box.
    2. In the output panel, click the [+] button to add two columns for the output schema, and name them FirstName and FamilyInfo respectively.
    3. Select String from the Type list for the FirstName column to extract this column from the input schema to carry the first name of each person on the name list.
    4. Select Dynamic from the Type list for the FamilyInfo column so that this column will carry the rest information of each person on the name list: the last name, house number, street and city, which all together will identify a family.
    5. Then, click OK to propagate the schema and close the Schema dialog box.
  6. Double-click the component labelled Deduplicate to display its Basic settings view.
  7. In the Unique key area, select the Key attribute check box for the FamilyInfo column to carry out deduplication on the family information.
  8. In the Basic settings view of the tFileOutputDelimited component, which is labelled Deduplicated_Families, define the output file path, select the Include header check box, and leave the other settings as they are.
  9. In the Basic settings view of the tLogRow component, which is labelled Duplicated_Families, select the Table option to view the Job execution result in table mode.

Saving and executing the Job

Procedure

  1. Press Ctrl+S to save your Job.
  2. Run the Job by pressing F6 or clicking the Run button on the Run tab.
    The information of duplicated families is displayed on the Run console, and only one person per family stays on the name list in the output file.

Deduplicating entries using Map/Reduce components

The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.

This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.

This scenario illustrates how to create a Talend Map/Reduce Job to deduplicate entries, that is to say, to use Map/Reduce components to generate Map/Reduce code and run the Job right in Hadoop.

Note that the Talend Map/Reduce components are available to subscription-based Big Data users only and this scenario can be replicated only with Map/Reduce components.

The sample data to be used in this scenario reads as follows:
1;Harry;Ford;68;Albany
2;Franklin;Wilson;79;Juneau
3;Ulysses;Roosevelt;25;Harrisburg
4;Harry;Ford;48;Olympia
5;Martin;Reagan;75;Columbia
6;Woodrow;Roosevelt;63;Harrisburg
7;Grover;McKinley;98;Atlanta
8;John;Taft;93;Montpelier
9;Herbert;Johnson;85;Lincoln
10;Grover;McKinley;33;Lansing

Since Talend Studio allows you to convert a Job between its Map/Reduce and Standard (Non Map/Reduce) versions, you can convert the scenario explained earlier to create this Map/Reduce Job. This way, many components used can keep their original settings so as to reduce your workload in designing this Job.

Before starting to replicate this scenario, ensure that you have appropriate rights and permissions to access the Hadoop distribution to be used. Then proceed as follows:

Converting the Job to a Big Data Batch Job

Procedure

  1. In the Repository tree view, right-click the Job you have created in the earlier scenario to open its contextual menu and select Edit properties.
    Then the Edit properties dialog box is displayed. Note that the Job must be closed before you are able to make any changes in this dialog box.
    This dialog box looks like the image below:
    Note that you can change the Job name as well as the other descriptive information about the Job from this dialog box.
  2. From the Job Type list, select Big Data Batch. Then a Map/Reduce Job using the same name appears under the Big Data Batch sub-node of the Job Design node.

Rearranging the components

Procedure

  1. Double-click this new Map/Reduce Job to open it in the workspace. The Map/Reduce components' Palette is opened accordingly and in the workspace, the crossed-out components, if any, indicate that those components do not have the Map/Reduce version.
  2. Right-click each of those components in question and select Delete to remove them from the workspace.
  3. Drop a tHDFSInput component, a tHDFSOutput component and a tJDBCOutput component in the workspace. The tHDFSInput component reads data from the Hadoop distribution to be used, the tHDFSOutput component writes data in that distribution and the tJDBCOutput component writes data in a given database, for example, a MySQL database in this scenario. The two output components replace the two tLogRow components to output data.
    If from scratch, you have to drop a tSortRow component and a tUniqRow component, too.
  4. Connect tHDFSInput to tSortRow using the Row > Main link and accept to get the schema of tSortRow.
  5. Connect tUniqRow to tHDFSOutput using Row > Uniques and to tJDBCOutput using Row > Duplicates.

Setting up Hadoop connection

Procedure

  1. Click Run to open its view and then click the Hadoop Configuration tab to display its view for configuring the Hadoop connection for this Job.
  2. From the Property type list, select Built-in. If you have created the connection to be used in Repository, then select Repository and thus the Studio will reuse that set of connection information for this Job.
  3. In the Version area, select the Hadoop distribution to be used and its version.
    • If you use Google Cloud Dataproc, see Google Cloud Dataproc.

    • If you cannot find the Cloudera or Hortonworks version to be used from the Version drop-down list, you can add your distribution via some dynamic distribution settings in the Studio.
      • On the version list of the distributions, some versions are labelled Builtin. These versions were added by Talend via the Dynamic distribution mechanism and delivered with the Studio when the Studio was released. They are certified by Talend, thus officially supported and ready to use.
    • If you cannot find from the list the distribution corresponding to yours, select Custom so as to connect to a Hadoop distribution not officially supported in the Studio. For a step-by-step example about how to use this Custom option, see Connecting to a custom Hadoop distribution.

  4. In the Name node field, enter the location of the master node, the NameNode, of the distribution to be used. For example, hdfs://tal-qa113.talend.lan:8020.
    • If you are using a MapR distribution, you can simply leave maprfs:/// as it is in this field; then the MapR client will take care of the rest on the fly for creating the connection. The MapR client must be properly installed. For further information about how to set up a MapR client, see the following link in MapR's documentation: http://doc.mapr.com/display/MapR/Setting+Up+the+Client

    • If you are using WebHDFS, the location should be webhdfs://masternode:portnumber; WebHDFS with SSL is not supported yet.

  5. In the Resource Manager field, enter the location of the ResourceManager of your distribution. For example, tal-qa114.talend.lan:8050.
    • Then you can continue to set the following parameters depending on the configuration of the Hadoop cluster to be used (if you leave the check box of a parameter clear, then at runtime, the configuration about this parameter in the Hadoop cluster to be used will be ignored):

      • Select the Set resourcemanager scheduler address check box and enter the Scheduler address in the field that appears.

      • Select the Set jobhistory address check box and enter the location of the JobHistory server of the Hadoop cluster to be used. This allows the metrics information of the current Job to be stored in that JobHistory server.

      • Select the Set staging directory check box and enter this directory defined in your Hadoop cluster for temporary files created by running programs. Typically, this directory can be found under the yarn.app.mapreduce.am.staging-dir property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.

      • Select the Use datanode hostname check box to allow the Job to access datanodes via their hostnames. This actually sets the dfs.client.use.datanode.hostname property to true. When connecting to a S3N filesystem, you must select this check box.

  6. If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal name for the NameNode in the field displayed. This enables you to use your user name to authenticate against the credentials stored in Kerberos.
    • If this cluster is a MapR cluster of the version 5.0.0 or later, you can set the MapR ticket authentication configuration in addition or as an alternative by following the explanation in Connecting to a security-enabled MapR.

      Keep in mind that this configuration generates a new MapR security ticket for the username defined in the Job in each execution. If you need to reuse an existing ticket issued for the same username, leave both the Force MapR ticket authentication check box and the Use Kerberos authentication check box clear, and then MapR should be able to automatically find that ticket on the fly.

    In addition, since this component performs Map/Reduce computations, you also need to authenticate the related services such as the Job history server and the Resource manager or Jobtracker depending on your distribution in the corresponding field. These principals can be found in the configuration files of your distribution. For example, in a CDH4 distribution, the Resource manager principal is set in the yarn-site.xml file and the Job history principal in the mapred-site.xml file.

    If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend Jobserver.

    Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the user name you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used.

  7. In the User name field, enter the login user name for your distribution. If you leave it empty, the user name of the machine hosting the Studio will be used.
  8. In the Temp folder field, enter the path in HDFS to the folder where you store the temporary files generated during Map/Reduce computations.
  9. Leave the default value of the Path separator in server as it is, unless you have changed the separator used by your Hadoop distribution's host machine for its PATH variable or in other words, that separator is not a colon (:). In that situation, you must change this value to the one you are using in that host.
  10. Leave the Clear temporary folder check box selected, unless you want to keep those temporary files.
  11. Leave the Compress intermediate map output to reduce network traffic check box selected, so as to spend shorter time to transfer the mapper task partitions to the multiple reducers.
    However, if the data transfer in the Job is negligible, it is recommended to clear this check box to deactivate the compression step, because this compression consumes extra CPU resources.
  12. If you need to use custom Hadoop properties, complete the Hadoop properties table with the property or properties to be customized. Then at runtime, these changes will override the corresponding default properties used by the Studio for its Hadoop engine.
    For further information about the properties required by Hadoop, see Apache's Hadoop documentation on http://hadoop.apache.org, or the documentation of the Hadoop distribution you need to use.
  13. If the HDFS transparent encryption has been enabled in your cluster, select the Setup HDFS encryption configurations check box and in the HDFS encryption key provider field that is displayed, enter the location of the KMS proxy.

    For further information about the HDFS transparent encryption and its KMS proxy, see Transparent Encryption in HDFS.

  14. You can tune the map and reduce computations by selecting the Set memory check box to set proper memory allocations for the computations to be performed by the Hadoop system.

    The memory parameters to be set are Map (in Mb), Reduce (in Mb) and ApplicationMaster (in Mb). These fields allow you to dynamically allocate memory to the map and the reduce computations and the ApplicationMaster of YARN.

    For further information about the Resource Manager, its scheduler and the ApplicationMaster, see YARN's documentation such as http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/.

    For further information about how to determine YARN and MapReduce memory configuration settings, see the documentation of the distribution you are using, such as the following link provided by Hortonworks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html.

  15. If you are using Cloudera V5.5+, you can select the Use Cloudera Navigator check box to enable the Cloudera Navigator of your distribution to trace your Job lineage to the component level, including the schema changes between components.

    With this option activated, you need to set the following parameters:

    • Username and Password: this is the credentials you use to connect to your Cloudera Navigator.

    • Cloudera Navigator URL : enter the location of the Cloudera Navigator to be connected to.

    • Cloudera Navigator Metadata URL: enter the location of the Navigator Metadata.

    • Activate the autocommit option: select this check box to make Cloudera Navigator generate the lineage of the current Job at the end of the execution of this Job.

      Since this option actually forces Cloudera Navigator to generate lineages of all its available entities such as HDFS files and directories, Hive queries or Pig scripts, it is not recommended for the production environment because it will slow the Job.

    • Kill the job if Cloudera Navigator fails: select this check box to stop the execution of the Job when the connection to your Cloudera Navigator fails.

      Otherwise, leave it clear to allow your Job to continue to run.

    • Disable SSL validation: select this check box to make your Job to connect to Cloudera Navigator without the SSL validation process.

      This feature is meant to facilitate the test of your Job but is not recommended to be used in a production cluster.

  16. If you are using Hortonworks Data Platform V2.4.0 onwards and you have installed Atlas in your cluster, you can select the Use Atlas check box to enable Job lineage to the component level, including the schema changes between components.

    With this option activated, you need to set the following parameters:

    • Atlas URL: enter the location of the Atlas to be connected to. It is often http://name_of_your_atlas_node:port

    • Die on error: select this check box to stop the Job execution when Atlas-related issues occur, such as connection issues to Atlas.

      Otherwise, leave it clear to allow your Job to continue to run.

    In the Username and Password fields, enter the authentication information for access to Atlas.

Configuring input and output components

Configuring tHDFSInput

Procedure

  1. Double-click tHDFSInput to open its Component view.
  2. Click the button next to Edit schema to verify that the schema received in the earlier steps is properly defined.
    Note that if you are creating this Job from scratch, you need to click the button to manually add these schema columns; otherwise, if the schema has been defined in Repository, you can select the Repository option from the Schema list in the Basic settings view to reuse it. For further information about how to define a schema in Repository, see the chapter describing metadata management in the Talend Studio User Guide or the chapter describing the Hadoop cluster node in Repository of the Getting Started Guide.
  3. If you make changes in the schema, click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.
  4. In the Folder/File field, enter the path, or browse to the source file you need the Job to read.
    If this file is not in the HDFS system to be used, you have to place it in that HDFS, for example, using tFileInputDelimited and tHDFSOutput in a Standard Job.

Reviewing the transformation components

Procedure

  1. Double-click tSortRow to open its Component view.
    This component keeps its configuration used by the original Job. It sorts the incoming entries into alphabetical order depending on the FirstName and the LastName columns.
  2. Double-click tUniqRow to open its Component view.
    The component keeps as well its configuration from the original Job. It separates the incoming entries into a Uniques flow and a Duplicates flow, then sends the unique entries to tHDFSOutput and the duplicate entries to tJDBCOutput.

Configuring tHDFSOutput

Procedure

  1. Double-click tHDFSOutput to open its Component view.
  2. As explained earlier for verifying the schema of tHDFSInput, do the same to verify the schema of tHDFSOutput. If it is not consistent with that of its preceding component, tUniqRow, click Sync column to retrieve the schema of tUniqRow.
  3. In the Folder field, enter the path, or browse to the folder you want to write the unique entries in.
  4. From the Action list, select the operation you need to perform on the folder in question. If the folder already exists, select Overwrite; otherwise, select Create.

Configuring tJDBCOutput

Procedure

  1. Double-click tJDBCOutput to open its Component view.