This scenario describes a five-component Job that generates data records in the form of tasks and loads them into the stewardship console database.
These tasks will need later the intervention of an authorized data steward to merge, compare and resolve the data records that are held in these tasks. For further information, see Talend Data Stewardship Console User Guide.
In this scenario:
A tFixedFlowInput component generates input data flow that has five columns: Source, Firstname, Lastname, DOB (date of birth), and PostalCode. This data has problems such as duplication, first or last names spelled differently or wrongly, different information for the same customer, etc.
A tMatchGroup data quality component carries out matching operations on data across the heterogeneous sources defined in the input Source column. This component groups the output columns by a blocking value to optimize the matching operation and compare only the records that have the same blocking value, the Source column in this scenario. For more information on grouping output columns and using blocking values, see tMatchGroup.
A tMap component filters the input flow into unique data records and data records that have matching distances.
The unique data records are displayed on the Run console via the tLogRow component. All other data records that have a matching distance are sent to the Talend Data Stewardship Console database through the tStewardshipTaskOutput component and are displayed in the stewardship console. An authorized data steward can then intervene to merge the data records with matching distances.
For detail information about related scenarios, see Scenario 1: Generating functional keys in the output flow and Scenario 2: Comparing columns and grouping in the output flow duplicate records that have the same functional key.
Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tMatchGroup, tMap, tStewardshipTaskOutput and tLogRow.
Connect the first three components together using the Main link.
Double-click tFixedFlowInput to display the Basic settings view and define the component properties as described in Scenario 1: Generating functional keys in the output flow.
The tFixedFlowInput component generates an input data flow that has five columns: Source, Firstname, Lastname, DOB (date of birth), and PostalCode. This data has problems such as duplication, first or last names spelled differently or wrongly, different information for the same customer, etc.
Double-click the tMatchGroup component to display the Basic Settings view and define the component properties.
Click Sync columns to retrieve the schema from the preceding component.
If required, click the Edit schema button to view the input and output schema and do any modifications in the output schema.
In the output schema of this component, there are four output standard columns that are read-only. For more information, see tMatchGroup properties.
In the Key definition table, click the [+] button to add to the list the columns on which you want to do the matching operation, FirstName and LastName in this scenario.
Click in the first and second cells of the Matching type column and select from the list the method(s) to be used for the matching operation, Jaro-Winkler in this example.
Click in the first and second cells of the Confidence Weight column and set the numerical weights for each of the columns used as key attributes.
Click the [+] button below the Blocking Definition table to add a line in the table then click in the line and select from the list the column you want to use as a blocking value, Source in this example.
Using a blocking value reduces the number of pairs of records that needs to be examined. The input data is partitioned into exhaustive blocks based on the data source. This will decrease the number of pairs to compare, as comparison is restricted to record pairs within each block.
Double-click the tMap component to open the Map Editor.
The input area to the left is already filled with the input schema coming from the previous component in the Job design.
Click the [+] button in the upper right corner of the output area to add as many output tables as needed, two in this example uniques and groups. The first table will group the unique data records and the second will group all the records that have matching distances to the master record in each group.
Drop the input columns to fill in the first output schema. For further information regarding data mapping, see Talend Studio User Guide.
All the columns will be automatically filled in the Schema Editor in the below half of the Map Editor.
Click in the upper right corner of the first output table to add a condition to filter the data in the first output table: row2.GRP_SIZE == 1.
Drop the input columns to fill in the second output schema and add the following filter: row2.GRP_SIZE > 1 || !row2.MASTER.
In the Schema Editor of the second output table, click the [+] button to add two extra columns: weight and istarget. The first to measure the matching distance and the second to decide if the record will be a target record or a source record.
Click Ok to close the Map Editor.
In the design workspace, right-click tMap and select the uniques link and drop it on the tLogRow component. Do the same to connect tMap to tStewardshipTaskOutput with the groups link.
Double-click the tStewardshipTaskOutput component to display its Basic settings view and define the component properties.