tGenKey - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Warning

This component will be available in the Palette of Talend Studio on the condition that you have subscribed to one of the Talend Platform products.

Function

tGenKey enables you to apply several kinds of algorithms on each input column and use the computed results to generate a functional key. These algorithms can be key or optional algorithms.

The values returned by the key algorithms will be concatenated, according to the column order in the Key composition table.

Purpose

tGenKey generates a functional key from the input columns, by applying different types of algorithms on each column and grouping the computed results in one key. It outputs this key together with the input columns.

This component helps, using the generated functional key, to narrow down your data filter/matching results for example.

If you have subscribed to one of the Talend solutions with Big Data, this component is available in the following types of Jobs:

tGenKey properties

Component family

Data Quality

 

Basic settings

Schema and Edit schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

Click Sync columns to retrieve the schema from the previous component in the Job.

 

 

Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created and stored the schema in the Repository. You can reuse it in other projects and job designs. Related topic: see Talend Studio User Guide.

 

Click the import icon to import blocking keys from the match rules that are defined and saved in the Studio repository.

When you click the import icon, a [Match Rule Selector] wizard is opened to help you import blocking keys from the match rules listed in the Studio repository and use them in your Job.

You can import blocking keys only from match rules that are defined with the VSR algorithm and saved in the Studio repository. For further information, see Importing match rules from the studio repository

Algorithm

Column

Select the column(s) from the main flow on which you want to define certain algorithms to set the functional key.

Note

When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.

 

Pre-Algorithm

If required, select the relevant matching algorithm from the list:

remove diacritical marks:removes any diacritical mark.

remove diacritical marks and lower case: removes any diacritical mark and converts to lower case before generating the code of the column.

remove diacritical marks and upper case: removes any diacritical mark and converts to upper case before generating the code of the column.

lower case: converts the field to lower case before applying the key algorithm.

upper case: converts the field to upper case before applying the key algorithm.

add left position character: enables you to add a character to the left of the column.

add right position character: enables you to add a character to the right of the column.

 

Value

Set the algorithm value, where applicable.

 

Algorithm

Select the relevant algorithm from the list:

first character of each word: includes in the functional key the first character of each word in the column.

N first characters of each word: includes in the functional key N first characters of each word in the column.

first N characters of the string: includes in the functional key N first characters of the string.

last N characters of the string: includes in the functional key N last characters of the string.

first N consonants of the string: includes in the functional key N first consonants of the string.

first N vowels of the string: includes in the functional key N first vowels of the string.

pick characters: includes in the functional key the characters located at a fixed position (corresponding to the set digital/range).

exact: includes in the functional key the full string.

substring(a,b): includes in the functional key character according to the set index.

soundex code: generates a code according to a standard English phonetic algorithm. This code represents the character string that will be included in the functional key.

metaphone code: generates a code according to the character pronunciation. This code represents the character string that will be included in the functional key.

double-metaphone code: generates a code according to the character pronunciation using a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. This code represents the character string that will be included in the functional key.

fingerPrintkey: generates the functional key from a string value through the following sequential process:

  1. remove leading and trailing whitespace,

  2. change all characters to their lowercase representation,

  3. remove all punctuation and control characters,

  4. split the string into whitespace-separated tokens,

  5. sort the tokens and remove duplicates,

  6. join the tokens back together,

    Because the string parts are sorted, the given order of tokens does not matter. So, Cruise, Tom and Tom Cruise both end up with a fingerprint cruise tom and therefore end up in the same cluster.

  7. normalize extended western characters to their ASCII representation, for example gödel to godel.

    This reproduce data entry mistakes performed when entering extended characters with an ASCII-only keyboard. However, this procedure can also lead to false positives, for example gödel and godél would both end up with godel as their fingerprint but they are likely to be different names. So this might work less effectively for datasets where extended characters play substantial differentiation role.

nGramkey: this algorithm is similar to the fingerPrintkey method described above. But instead of using whitespace separated tokens, it uses n-grams, where the n can be specified by the user. This method generates the functional key from a string value through the following sequential process:

  1. change all characters to their lowercase representation,

  2. remove all punctuation and control characters,

  3. obtain all the string n-grams,

  4. sort the n-grams and remove duplicates,

  5. join the sorted n-grams back together,

  6. normalize extended western characters to their ASCII representation, for example gödel to godel.

    For example, the 2-gram fingerprint of Paris is arispari and the 1-gram fingerprint is aiprs.

    The delivered implementation of this algorithm is 2-grams.

Note

If the column on which you want to use the nGramkey algorithm can have data, with only 0 or 1 characters, you must filter this data before generating the functional key. This way you avoid potentially comparing records to each other that should not be match candidates.

colognPhonetic: a soundex phonetic algorithm optimized for the German language. It encodes a string into a Cologne phonetic value. This code represents the character string that will be included in the functional key.

 

Value

Set the algorithm value, where applicable.

 

Post-Algorithm

If required, select the relevant matching algorithm from the list:

use default value (string): enables you to choose a string to replace null or empty data.

add left position character: enables you to add a character to the left of the column.

add right position character: enables you to add a character to the right of the column.

 

Value

Set the option value, where applicable.

 

Show help

Select this check box to display instructions on how to set algorithms/options parameters.

Advanced settings

tStat Catcher Statistics

Select this check box to collect log data at the component level. Note that this check box is not available in the Map/Reduce version of the component.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is an intermediary step. It requires an input flow as well as an output. This component can be used with other components, such as tMatchGroup, in order to create a blocking key.

Limitation/prerequisite

n/a

Scenario 1: Generating functional keys in the output flow

This three-component scenario describes a basic Job that generates a functional key for each of the data records using one algorithm on one of the input columns, PostalCode.

This functional key can be used in different ways to narrow down the results of data filtering or data matching, for example. So the tGenKey component can be used with so many other data quality and data integration components to form different useful use cases. For an example of one use case of tGenKey, see Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.

In this scenario, the input data flow has four columns: Firstname, Lastname, DOB (date of birth), and PostalCode. This data has problems such as duplication, first or last names spelled differently or wrongly, different information for the same customer, etc. This scenario generates a functional key for each data record using an algorithm that concatenates the first two characters of the postal code.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tGenKey and tLogRow.

  2. Connect all the components together using the Main link.

Configuring the data input

  1. Double-click tFixedFlowInput to display the Basic settings view and define the component properties.

  2. Click the [...] button next to Edit Schema to open the [Schema] dialog box.

  3. Click the plus button to add as many lines as needed for the input schema you want to create from internal variables.

    In this example, the schema is made of four columns: Firstnam, Lastname, DOB and PostalCode.

    Then, click OK to close the dialog box.

  4. In the Mode area, select the Use Inline Table option.

    The Value table displays as Inline Table.

  5. Click the plus button to add as many line as needed and then click in each of the lines to define the input data for the four columns.

Configuring key generation

  1. Double-click tGenKey to display the Basic settings view and define the component properties.

    You can click and import blocking keys from the match rules created with the VSR algorithm and tested in the Profiling perspective of Talend Studio and use them in your Job. Otherwise, define the blocking key parameters as described in the below steps.

  2. Under the Algorithm table, click the plus button to add a row in this table.

  3. On the column column, click the newly added row and select from the list the column you want to process using an algorithm. In this example, select PostalCode.

    Note

    When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.

    For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.

  4. On the algorithm column, click the newly added row and select from the list the algorithm you want to apply to the corresponding column. In this example, select first N characters of the string.

  5. Click in the Value column and enter the value for the selected algorithm, when needed. In this scenario, type in 2.

    In this example, we want to generate a functional key that holds the first two characters of the postal code for each of the data rows and we do not want to define any extra options on these columns.

    Note

    You can select the Show help check box to display instructions on how to set algorithms/options parameters.

    Once you have defined the tGenKey properties, you can display a statistical view of these parameters. To do so:

  6. Right-click the tGenKey component and select View Key Profile in the contextual menu.

    The View Key Profile editor displays, allowing you to visualize the statistics regarding the number of rows per block and to adapt them according to the results you want to get.

    Note

    When you are processing a large amount of data and when this component is used to partition data in order to use them in a matching component (such as tRecordMatching or tMatchGroup), it is preferable to have a limited number of rows in one block. An amount of about 50 rows per block is considered optimal, but it depends on the number of fields to compare, the total number of rows and the time considered acceptable for data processing.

Configuring the console output

  1. Double-click the tLogRow component to display the Basic settings view.

  2. In the Mode area, select Table to display the Job execution result in table cells.

Executing the Job

  • Save your Job and press F6 to execute it.

    All the output columns including the T_GEN_KEY column are listed in the Run console. The functional key for each data record is concatenated from the first two characters of the corresponding value in the PostalCode column.

Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key

This second scenario describes a Job that aims at:

  • generating a functional key using one algorithm on one of the input columns, DoB as described in scenario 1.

  • matching two input columns using the Jaro-Winkler algorithm.

  • grouping the output columns by the generated functional key to optimize the matching operation and compare only the records that have the same blocking value, functional key in this scenario. For more information on grouping output columns and using blocking values, see tMatchGroup.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tRowGenerator, tGenKey and tMatchGroup.

  2. Connect all the components together using the Main link.

Configuring the data input

  1. Double-click tRowGenerator to define its schema as follows:

    The tRowGenerator component will generate an input data flow that has three columns: Firstname, Lastname, and DoB (date of birth). This data may have problems such as duplication, first or last names spelled differently or wrongly, different information for the same customer, etc.

  2. Click OK to validate the settings and close the dialog box, and accept to propagate the changes when prompted.

Configuring key generation

  1. Double-click tGenKey to display the Basic settings view and define the component properties.

    You can click and import blocking keys from the match rules created with the VSR algorithm and tested in the Profiling perspective of Talend Studio and use them in your Job. Otherwise, define the blocking key parameters as described in the below steps.

  2. Under the Algorithm table, click the [+] button to add a row in this table.

  3. On the column column, click the newly added row and select from the list the column you want to process using an algorithm. In this example, select DoB.

  4. On the algorithm column, click the newly added row and select from the list the algorithm you want to apply to the corresponding column. In this example, select substring(a,b).

  5. Click in the value column and enter the value for the selected algorithm, when needed. In this scenario, type in 6;10.

    Note

    The substring(a,b) algorithm allows you to extract the characters from a string, between two specified indices, and to return the new substring. First character is at index 0. In this scenario, for a given DoB "21-01-1995", 6;10 will return only the year of birth, that is to say "1995" which is the substring from the 7th to the 10th character.

    In this example, we want to generate a functional key that holds the last four characters of the date of birth, which correspond to the year of birth, for each of the data rows and we do not want to define any extra options on these columns.

    Note

    You can select the Show help check box to display instructions on how to set algorithms/options parameters.

    Once you have defined the tGenKey properties, you can display a statistical view of these parameters. To do so:

  6. Right-click on the tGenKey component and select View Key Profile in the contextual menu.

    The View Key Profile editor displays, allowing you to visualize statistics regarding the number of blocks and to adapt the parameters according to the results you want to get.

    Note

    When you are processing a large amount of data and when this component is used to partition data in order to use them in a matching component (such as tRecordMatching or tMatchGroup), it is preferable to have a limited number of rows in one block. An amount of about 50 rows per block is considered optimal, but it depends on the number of fields to compare, the total number of rows and the time considered acceptable for data processing.

    From the key editor, you can:

    • edit the Limit of rows used to calculate the statistics.

    • click and import blocking keys from the Studio repository and use them in your Job.

    • edit the input column you want to process using an algorithm.

    • edit the parameters of the algorithm you want to apply to input columns.

    Note

    Every time you make a modification, you can see its implications by clicking the Refresh button which is located at the top right part of the editor.

  7. Click OK to close the View Key Profile editor.

Configuring the grouping of the output data

  1. Click the tMatchGroup component, and then in its basic settings click the Edit schema button to view the input and output columns and do any modifications in the output schema, if needed.

    In the output schema of this component, there are output standard columns that are read-only. For more information, see tMatchGroup properties.

  2. Click OK to close the dialog box.

  3. Double-click the tMatchGroup component to display its Configuration Wizard and define the component properties.

    Note

    If you want to add a fixed output column, MATCHING_DISTANCES, which gives the details of the distance between each column, click the Advanced settings tab and select the Output distance details check box. For more information, see tMatchGroup properties.

  4. In the Key definition table, click the plus button to add to the list the columns on which you want to do the matching operation, FirstName and LastName in this scenario.

  5. Click in the first and second cells of the Matching Function column and select from the list the algorithm(s) to be used for the matching operation, Jaro-Winkler in this example.

  6. Click in the first and second cells of the Weight column and set the numerical weights for each of the columns used as key attributes.

  7. In the Match threshold field, enter the match probability threshold. Two data records match when the probability threshold is above this value.

  8. Click the plus button below the Blocking Selection table to add a line in the table, then click in the line and select from the list the column you want to use as a blocking value, T_GEN_KEY in this example.

    Using a blocking value reduces the number of pairs of records that needs to be examined. The input data is partitioned into exhaustive blocks based on the functional key. This will decrease the number of pairs to compare, as comparison is restricted to record pairs within each block.

  9. Click the Chart button in the top right corner of the wizard to execute the Job in the defined configuration and have the matching results directly in the wizard.

    The matching chart gives a global picture about the duplicates in the analyzed data. The matching table indicates the details of items in each group and colors the groups in accordance with their color in the matching chart.

Configuring the console output

  1. Double-click the tLogRow component to display the Basic settings view.

  2. In the Mode area, select Table to display the Job execution result in table cells.

Executing the Job

  • Save your Job and press F6 to execute it.

    The output columns include the T_GEN_KEY column that holds the functional key generated by the tGenKey component.

    You can see that all records that have the same functional key are grouped together in different blocks "groups". The identifier for each group is listed in the GID column next to the corresponding record. The number of records in each of the output blocks is listed in the GRP_SIZE column and computed only on the master record. The MASTER column indicates with true/false if the corresponding record is a master record or not a master record. The SCORE column lists the calculated distance between the input record and the master record according to the Jaro-Winkler matching algorithm.

    For an example of creating data partitions based on different blocking keys and using them with multiple tMatchGroup components, see tMatchGroup.

tGenKey in Talend Map/Reduce Jobs

Warning

The information in this section is only for users that have subscribed to one of the Talend solutions with Big Data and is not applicable to Talend Open Studio for Big Data users.

In a Talend Map/Reduce Job, tGenKey, as well as the other Map/Reduce components preceding it, generates native Map/Reduce code. This section presents the specific properties of tGenKey when it is used in that situation. For further information about a Talend Map/Reduce Job, see Talend Big Data Getting Started Guide.

Component family

Data Quality

 

Basic settings

Schema and Edit schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

Click Sync columns to retrieve the schema from the previous component in the Job.

 

 

Built-in: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created and stored the schema in the Repository. You can reuse it in other projects and job designs. Related topic: see Talend Studio User Guide.

 

Click the import icon to import blocking keys from the match rules that are defined and saved in the Studio repository.

When you click the import icon, a [Match Rule Selector] wizard is opened to help you import blocking keys from the match rules listed in the Studio repository and use them in your Job.

You can import blocking keys only from match rules that are defined with the VSR algorithm and saved in the Studio repository. For further information, see Importing match rules from the studio repository

Algorithm

Column

Select the column(s) from the main flow on which you want to define certain algorithms to set the functional key.

Note

When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.

 

Pre-Algorithm

If required, select the relevant matching algorithm from the list:

remove diacritical marks:removes any diacritical mark.

remove diacritical marks and lower case: removes any diacritical mark and converts to lower case before generating the code of the column.

remove diacritical marks and upper case: removes any diacritical mark and converts to upper case before generating the code of the column.

lower case: converts the field to lower case before applying the key algorithm.

upper case: converts the field to upper case before applying the key algorithm.

add left position character: enables you to add a character to the left of the column.

add right position character: enables you to add a character to the right of the column.

 

Value

Set the algorithm value, where applicable.

 

Algorithm

Select the relevant algorithm from the list:

first character of each word: includes in the functional key the first character of each word in the column.

N first characters of each word: includes in the functional key N first characters of each word in the column.

first N characters of the string: includes in the functional key N first characters of the string.

last N characters of the string: includes in the functional key N last characters of the string.

first N consonants of the string: includes in the functional key N first consonants of the string.

first N vowels of the string: includes in the functional key N first vowels of the string.

pick characters: includes in the functional key the characters located at a fixed position (corresponding to the set digital/range).

exact: includes in the functional key the full string.

substring(a,b): includes in the functional key character according to the set index.

soundex code: generates a code according to a standard English phonetic algorithm. This code represents the character string that will be included in the functional key.

metaphone code: generates a code according to the character pronunciation. This code represents the character string that will be included in the functional key.

double-metaphone code: generates a code according to the character pronunciation using a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. This code represents the character string that will be included in the functional key.

fingerPrintkey: generates the functional key from a string value through the following sequential process:

  1. remove leading and trailing whitespace,

  2. change all characters to their lowercase representation,

  3. remove all punctuation and control characters,

  4. split the string into whitespace-separated tokens,

  5. sort the tokens and remove duplicates,

  6. join the tokens back together,

    Because the string parts are sorted, the given order of tokens does not matter. So, Cruise, Tom and Tom Cruise both end up with a fingerprint cruise tom and therefore end up in the same cluster.

  7. normalize extended western characters to their ASCII representation, for example gödel to godel.

    This reproduce data entry mistakes performed when entering extended characters with an ASCII-only keyboard. However, this procedure can also lead to false positives, for example gödel and godél would both end up with godel as their fingerprint but they are likely to be different names. So this might work less effectively for datasets where extended characters play substantial differentiation role.

nGramkey: this algorithm is similar to the fingerPrintkey method described above. But instead of using whitespace separated tokens, it uses n-grams, where the n can be specified by the user. This method generates the functional key from a string value through the following sequential process:

  1. change all characters to their lowercase representation,

  2. remove all punctuation and control characters,

  3. obtain all the string n-grams,

  4. sort the n-grams and remove duplicates,

  5. join the sorted n-grams back together,

  6. normalize extended western characters to their ASCII representation, for example gödel to godel.

    For example, the 2-gram fingerprint of Paris is arispari and the 1-gram fingerprint is aiprs.

    The delivered implementation of this algorithm is 2-grams.

Note

If the column on which you want to use the nGramkey algorithm can have data, with only 0 or 1 characters, you must filter this data before generating the functional key. This way you avoid potentially comparing records to each other that should not be match candidates.

colognPhonetic: a soundex phonetic algorithm optimized for the German language. It encodes a string into a Cologne phonetic value. This code represents the character string that will be included in the functional key.

 

Value

Set the algorithm value, where applicable.

 

Post-Algorithm

If required, select the relevant matching algorithm from the list:

use default value (string): enables you to choose a string to replace null or empty data.

add left position character: enables you to add a character to the left of the column.

add right position character: enables you to add a character to the right of the column.

 

Value

Set the option value, where applicable.

 

Show help

Select this check box to display instructions on how to set algorithms/options parameters.

Advanced settings

tStat Catcher Statistics

Select this check box to collect log data at the component level. Note that this check box is not available in the Map/Reduce version of the component.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage in Map/Reduce Jobs

In a Talend Map/Reduce Job, this component is used as an intermediate step and other components used along with it must be Map/Reduce components, too. They generate native Map/Reduce code that can be executed directly in Hadoop.

For further information about a Talend Map/Reduce Job, see the sections describing how to create, convert and configure a Talend Map/Reduce Job of the Talend Big Data Getting Started Guide.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs, and non Map/Reduce Jobs.

Limitation/prerequisite

n/a

Related scenarios

No scenario is available for the Map/Reduce version of this component yet.