This three-component scenario describes a basic Job that generates a functional key for each of the data records using one algorithm on one of the input columns, PostalCode.
This functional key can be used in different ways to narrow down the results of data filtering or data matching, for example. So the tGenKey component can be used with so many other data quality and data integration components to form different useful use cases. For an example of one use case of tGenKey, see Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.
In this scenario, the input data flow has four columns: Firstname, Lastname, DOB (date of birth), and PostalCode. This data has problems such as duplication, first or last names spelled differently or wrongly, different information for the same customer, etc. This scenario generates a functional key for each data record using an algorithm that concatenates the first two characters of the postal code.
Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tGenKey and tLogRow.
Connect all the components together using the Main link.
Double-click tFixedFlowInput to display the Basic settings view and define the component properties.
Click the [...] button next to Edit Schema to open the [Schema] dialog box.
Click the plus button to add as many lines as needed for the input schema you want to create from internal variables.
In this example, the schema is made of four columns: Firstnam, Lastname, DOB and PostalCode.
Then, click OK to close the dialog box.
In the Mode area, select the Use Inline Table option.
The Value table displays as Inline Table.
Click the plus button to add as many line as needed and then click in each of the lines to define the input data for the four columns.
Double-click tGenKey to display the Basic settings view and define the component properties.
You can click and import blocking keys from the match rules created with the VSR algorithm and tested in the Profiling perspective of Talend Studio and use them in your Job. Otherwise, define the blocking key parameters as described in the below steps.
Under the Algorithm table, click the plus button to add a row in this table.
In the column column, click the newly added row and select from the list the column you want to process using an algorithm. In this example, select PostalCode.
When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.
For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.
Select the Show help check box to display instructions on how to set algorithms/options parameters.
In the algorithm column, click the newly added row and select from the list the algorithm you want to apply to the corresponding column. In this example, select first N characters of the string.
Click in the Value column and enter the value for the selected algorithm, when needed. In this scenario, type in 2.
In this example, we want to generate a functional key that holds the first two characters of the postal code for each of the data rows and we do not want to define any extra options on these columns.
Make sure to set a value for the algorithm which need one, otherwise you may have a compilation error when you run the Job.
Once you have defined the tGenKey properties, you can display a statistical view of these parameters. To do so:
Right-click the tGenKey component and select View Key Profile in the contextual menu.
The View Key Profile editor displays, allowing you to visualize the statistics regarding the number of rows per block and to adapt them according to the results you want to get.
When you are processing a large amount of data and when this component is used to partition data in order to use them in a matching component (such as tRecordMatching or tMatchGroup), it is preferable to have a limited number of rows in one block. An amount of about 50 rows per block is considered optimal, but it depends on the number of fields to compare, the total number of rows and the time considered acceptable for data processing.
For a use example of the View Key Profile option, see Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.
Double-click the tLogRow component to display the Basic settings view.
In the Mode area, select Table to display the Job execution result in table cells.