Scenario 1: Generating functional keys in the output flow - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This three-component scenario describes a basic Job that generates a functional key for each of the data records using one algorithm on one of the input columns, PostalCode.

This functional key can be used in different ways to narrow down the results of data filtering or data matching, for example. So the tGenKey component can be used with so many other data quality and data integration components to form different useful use cases. For an example of one use case of tGenKey, see Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.

In this scenario, the input data flow has four columns: Firstname, Lastname, DOB (date of birth), and PostalCode. This data has problems such as duplication, first or last names spelled differently or wrongly, different information for the same customer, etc. This scenario generates a functional key for each data record using an algorithm that concatenates the first two characters of the postal code.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tGenKey and tLogRow.

  2. Connect all the components together using the Main link.

Configuring the data input

  1. Double-click tFixedFlowInput to display the Basic settings view and define the component properties.

  2. Click the [...] button next to Edit Schema to open the [Schema] dialog box.

  3. Click the plus button to add as many lines as needed for the input schema you want to create from internal variables.

    In this example, the schema is made of four columns: Firstnam, Lastname, DOB and PostalCode.

    Then, click OK to close the dialog box.

  4. In the Mode area, select the Use Inline Table option.

    The Value table displays as Inline Table.

  5. Click the plus button to add as many line as needed and then click in each of the lines to define the input data for the four columns.

Configuring key generation

  1. Double-click tGenKey to display the Basic settings view and define the component properties.

    You can click and import blocking keys from the match rules created with the VSR algorithm and tested in the Profiling perspective of Talend Studio and use them in your Job. Otherwise, define the blocking key parameters as described in the below steps.

  2. Under the Algorithm table, click the plus button to add a row in this table.

  3. On the column column, click the newly added row and select from the list the column you want to process using an algorithm. In this example, select PostalCode.

    Note

    When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.

    For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.

  4. On the algorithm column, click the newly added row and select from the list the algorithm you want to apply to the corresponding column. In this example, select first N characters of the string.

  5. Click in the Value column and enter the value for the selected algorithm, when needed. In this scenario, type in 2.

    In this example, we want to generate a functional key that holds the first two characters of the postal code for each of the data rows and we do not want to define any extra options on these columns.

    Note

    You can select the Show help check box to display instructions on how to set algorithms/options parameters.

    Once you have defined the tGenKey properties, you can display a statistical view of these parameters. To do so:

  6. Right-click the tGenKey component and select View Key Profile in the contextual menu.

    The View Key Profile editor displays, allowing you to visualize the statistics regarding the number of rows per block and to adapt them according to the results you want to get.

    Note

    When you are processing a large amount of data and when this component is used to partition data in order to use them in a matching component (such as tRecordMatching or tMatchGroup), it is preferable to have a limited number of rows in one block. An amount of about 50 rows per block is considered optimal, but it depends on the number of fields to compare, the total number of rows and the time considered acceptable for data processing.

Configuring the console output

  1. Double-click the tLogRow component to display the Basic settings view.

  2. In the Mode area, select Table to display the Job execution result in table cells.

Executing the Job

  • Save your Job and press F6 to execute it.

    All the output columns including the T_GEN_KEY column are listed in the Run console. The functional key for each data record is concatenated from the first two characters of the corresponding value in the PostalCode column.