Defining the survivor validation flow

Deduplication

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Data Services Platform
Talend ESB
Talend Open Studio for Big Data
Talend Big Data
Talend Open Studio for ESB
Talend Big Data Platform
Talend Real-Time Big Data Platform
Talend Open Studio for Data Integration
Talend Open Studio for MDM
Talend Data Management Platform
Talend Data Integration
Talend MDM Platform
Talend Data Fabric
task
Data Quality and Preparation > Third-party systems > Data Quality components > Deduplication components
Design and Development > Third-party systems > Data Quality components > Deduplication components
Data Governance > Third-party systems > Data Quality components > Deduplication components
EnrichPlatform
Talend Studio

About this task

Having configured and grouped the input data, you need to create the survivor validation flow using tRuleSurvivorship. To do this, proceed as follows:

Procedure

  1. Double-click tRuleSurvivorship to open its Component view.
  2. Select GID for the Group identifier field and GRP_SIZE for the Group size field.
  3. In the Rule package name field, enter the name of the rule package you need to create to define the survivor validation flow of interest. In this example, this name is org.talend.survivorship.sample.
  4. In the Rule table, click the plus button to add as many rows as required and complete them using the corresponding rule definitions. In this example, add ten rows and complete them using the table below:

    Order

    Rule name

    Reference column

    Function

    Value

    Target column

    Sequential

    "1_LengthAcct"

    acctName

    Expression

    ".length >11"

    acctName

    Sequential

    "2_LongestAddr"

    addr

    Longest

    n/a

    addr

    Sequential

    "3_HighCredibility"

    credibility

    Expression

    "> 3"

    credibility

    Sequential

    "4_MostCommonCity"

    city

    Most common

    n/a

    city

    Sequential

    "5_MostCommonZip"

    zip

    Most common

    n/a

    zip

    Multi-condition

    n/a

    zip

    Match regex

    "\\d{5}"

    n/a

    Multi-target

    n/a

    n/a

    n/a

    n/a

    state

    Multi-target

    n/a

    n/a

    n/a

    n/a

    country

    Sequential

    "6_LatestPhone"

    date

    Most recent

    n/a

    phone

    Multi-target

    n/a

    n/a

    n/a

    n/a

    date

    Do not use special characters in rule names, otherwise the Job may not run correctly.
    These rules are executed in the top-down order. The Multi-condition rule is one of the conditions of the 5_MostCommonZip rule, so the rule-compliant zip code should be the most common zip code and meanwhile have five digits. The zip column is the target column of the 5_MostCommonZip rule and the two Multi-target rules below it add another two target columns, state and country, so the zip, the state and the country columns will be the source of the best-of-breed data. Thus once a zip code is validated, the corresponding record field values from these three columns will be selected.
    The same is true to the Sequential rule 6_LatestPhone. Once a date value is validated, the corresponding record field values will be selected from the phone and the date columns.
    Note:

    In this table, the fields reading n/a indicate that these fields are not available to the corresponding Order types or Function types you have selected. In the Rule table of the Basic settings view of tRuleSurvivorship, these unavailable fields are greyed out. For further information about this rule table, see the properties table at the beginning of this tRuleSurvivorShip section.

  5. Next to Generate rules and survivorship flow, click the icon to generate the rule package with its contents you have defined.
    Once done, you can find the generated rule package in the Metadata > Rules Management > Survivorship Rules directory of your Studio Repository. From there, you are able to open the newly created survivor validation flow of this example and read its diagram. For further information, see Talend Studio User Guide.