Scenario 1: Selecting the best-of-breed data from a group of duplicates to create a survivor - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

The Job in this scenario uses five components to group the duplicate data and create one single representation of these duplicates. This representation is the "survivor" at the end of the selection process and you can use this survivor, for example, to create a master copy of data for MDM.

The components used in this Job are:

  • tFixedFlowInput: it provides the input data to be processed by this Job. In the real-world use case, you may use another input component of interest to replace tFixedFlowInput for providing the required data.

  • tMatchGroup: it groups the duplicates of the input data and gives each group the information about its group ID and group size. The technical names of the information are GID and GRP_SIZE respectively and they are required by tRuleSurvivorship.

  • tRuleSurvivorship: it creates the user-defined survivor validation flow to select the best-of-breed data that composes the single representation of each duplicates group.

  • tFilterColumns: it rules out the technical columns and outputs the columns that carry the actual information of interest.

  • tLogRow: it presents the result of the Job execution.

Dropping and linking the components

  1. Drop tFixedFlowInput, tMatchGroup, tRuleSurvivorship, tFilterColumns and tLogRow from Palette onto the Design workspace.

  2. Right-click tFixedFlowInput to open its contextual menu and select the Row > Main link from this menu to connect this component to tMatchGroup.

  3. Do the same to create the Main link from tMatchGroup to tRuleSurvivorship, then to tFilterColumns and to tLogRow.

Configuring the process of grouping the input data

Setting up the input records

  1. Double-click tFixedFlowInput to open its Component view.

  2. Click the three-dot button next to Edit schema to open the schema editor.

  3. Click the plus button nine times to add nine rows and rename these rows respectively. In this example, they are: acctName, addr, city, state, zip, country, phone, data, credibility. They are the nine columns of the schema of the input data.

  4. In the Type column, select the data types for the rows of interest. In this example, select Date for the data column and Double for the credibility column.

    Note

    Be aware of setting the proper data type so that later you are able to define the validation rules easily.

  5. In the Date Pattern column, type in the data pattern to reflect the date format of interest. In this scenario, this format is yyyyMMdd.

  6. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

  7. In the Mode area of the Basic settings view, select Use Inline Content (delimited file) to enter the input data of interest.

  8. In the Content field, enter the input data to be processed. This data should correspond to the schema you have defined and in this example, the contents of the data are:

    GRIZZARD CO.;110 N MARYLAND AVE;GLENDALE;CA;912066;FR;8185431314;20110101;5
    GRIZZARD;110 NORTH MARYLAND AVENUE;GLENDALE;CA;912066;US;9003254892;20110118;4
    GRIZZARD INC;110 N. MARYLAND AVENUE;GLENDALE;CA;91206;US;(818) 543-1315;20110103;2
    GRIZZARD CO;1480 S COLORADO BOULEVARD;LOS ANGELES;CA;91206;US;(800) 325-4892;20110115;1

Grouping the duplicate records

  1. Right-click tMatchGroup to open its contextual menu and select Configuration Wizard.

    From the wizard, you can see how your groups look like and you can adjust the component settings in order to correctly get the similar matches.

  2. Click the plus button under the Key Definition table to add one row.

  3. In the Input Key Attribute column of this row, select acctName. This way, this column becomes the reference used to match the duplicates of the input data.

  4. In the Matching Function column, select the Jaro-Winkler matching algorithm.

  5. In the Match threshold field, enter the numerical value to indicate at which value two record fields match each other. In this example, type in 0.6.

  6. Click Chart to execute this matching rule and show the result in this wizard.

    If the input records are not put into one single group, replace 0.6 with a smaller value and click Chart again to check the result until all of the four records are in the same group.

    The Job in this scenario puts four similar records into one single duplicates group so that tRuleSurvivorship is able to create one survivor from them. This simple sample allows you to have a clear picture about how tRuleSurvivorship works along with other components to create the best data. However, in the real-world case, you may need to process much more data with complex duplicate situation and thus put the data into much more groups.

  7. Click OK to close this Configuration wizard and the Basic settings view of the tMatchGroup component is automatically filled with the parameters you have set.

    For further information about the Configuration wizard, see Configuration wizard

Defining the survivor validation flow

Having configured and grouped the input data, you need to create the survivor validation flow using tRuleSurvivorship. To do this, proceed as follows:

  1. Double-click tRuleSurvivorship to open its Component view.

  2. Select GID for the Group identifier field and GRP_SIZE for the Group size field.

  3. In the Rule package name field, enter the name of the rule package you need to create to define the survivor validation flow of interest. In this example, this name is org.talend.survivorship.sample.

  4. In the Rule table, click the plus button to add as many rows as required and complete them using the corresponding rule definitions. In this example, add ten rows and complete them using the contents as follows:

    Order

    Rule name

    Reference column

    Function

    Value

    Target column

    Sequential

    "1_LengthAcct"

    acctName

    Expression

    ".length >11"

    acctName

    Sequential

    "2_LongestAddr"

    addr

    Longest

    n/a

    addr

    Sequential

    "3_HighCredibility"

    credibility

    Expression

    "> 3"

    credibility

    Sequential

    "4_MostCommonCity"

    city

    Most common

    n/a

    city

    Sequential

    "5_MostCommonZip"

    zip

    Most common

    n/a

    zip

    Multi-condition

    n/a

    zip

    Match regex

    "\\d{5}"

    n/a

    Multi-target

    n/a

    n/a

    n/a

    n/a

    state

    Multi-target

    n/a

    n/a

    n/a

    n/a

    country

    Sequential

    "6_LatestPhone"

    date

    Most recent

    n/a

    phone

    Multi-target

    n/a

    n/a

    n/a

    n/a

    date

    These rules are executed in the top-down order. The Multi-condition rule is one of the conditions of the 5_MostCommonZip rule, so the rule-compliant zip code should be the most common zip code and meanwhile have five digits. The zip column is the target column of the 5_MostCommonZip rule and the two Multi-target rules below it add another two target columns, state and country, so the zip, the state and the country columns will be the source of the best-of-breed data. Thus once a zip code is validated, the corresponding record field values from these three columns will be selected.

    The same is true to the Sequential rule 6_LatestPhone. Once a date value is validated, the corresponding record field values will be selected from the phone and the date columns.

    Note

    In this table, the fields reading n/a indicate that these fields are not available to the corresponding Order types or Function types you have selected. In the Rule table of the Basic settings view of tRuleSurvivorship, these unavailable fields are greyed out. For further information about this rule table, see the properties table at the beginning of this tRuleSurvivorShip section.

  5. Next to Generate rules and survivorship flow, click the icon to generate the rule package with its contents you have defined.

    Once done, you can find the generated rule package in the Metadata > Rules Management > Survivorship Rules directory of your Studio Repository. From there, you are able to open the newly created survivor validation flow of this example and read its diagram. For further information, see Talend Studio User Guide.

Selecting the columns of interest

The schema of tRuleSurvivorship includes several technical columns like GID, GRP_SIZE, which are not interesting in this example, so you may need to use tFilterColumns to rule these technical columns out and leave the columns carrying actual data to be output. To do this, proceed as follows:

  1. Double-click tFilterColumns to open its Component view.

  2. Click Sync columns to retrieve the schema from its preceding component. If a dialog box pops up to prompt the propagation, click Yes to accept it.

  3. Click the three-dot button next to Edit schema to open the schema editor.

  4. On the tFilterColumns side of this editor, select the GID, GRP_SIZE, MASTER and SCORE columns and click the red cross icon below to remove them.

  5. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

Executing the Job

The tLogRow component is used to present the execution result of the Job. You can configure the presentation mode on its Component view.

To do this, double-click tLogRow to open the Component view and in the Mode area, select the Table (print values in cells of a table) check box.

To execute this Job, press F6.

Once done, the Run view is opened automatically, where you can check the execution result.

You can read that the last row is the survivor record because its SURVIVOR column indicates true. This record is composed of the best-of-breed data of each column from the four other rows which are the duplicates of the same group.

The CONFLICT column presents the columns carrying more than one record field values compliant with the given validation rules. Take the credibility column for example: apart from the survivor record whose credibility is 5.0, the CONFLICT column indicates that the credibility of the second record GRIZZARD is 4.0, also bigger than 3, the threshold set in the rules you have defined, however, as the credibility 5.0 appears in the first record GRIZZARD CO., tRuleSurvivorship selects it as best-of-breed data.