tRuleSurvivorship Properties - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Component family

Data Quality

 

Function

tRuleSurvivorship receives records where duplicates, or possible duplicates, are already estimated and grouped together. Based on user-defined business rules, it creates one single representation for each duplicates group using the best-of-breed data. This representation is called a "survivor".

Purpose

tRuleSurvivorship creates the single representation of an entity according to business rules. It helps to create a master copy of data for MDM.

Basic settings

Schema and Edit schema

A schema is a row description, it defines the number of fields to be processed and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.

This component provides two read-only columns:

  • SURVIVOR: this column is of type Boolean. It indicates whether a record is the survivor (true) or not (false). There will be only one survivor for each group .

  • CONFLICT: when more than one record meet a given business rule, this column presents them.

 

 

Built-in: The schema will be created and stored locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: The schema already exists and is stored in the Repository, hence can be reused in various projects and job designs. Related topic: see Talend Studio User Guide.

 

Group identifier

Select the column whose content indicates the required group identifiers from the input schema.

 

Group size

Select the column whose content indicates the required group size from the input schema.

 

Rule package name

Type in the name of the rule package you want to create with this component.

  Generate rules and survivorship flow

Once you have defined all of the rules of a rule package or modified some of them with this component, click the icon to generate this rule package into the Survivorship Rules node of Rules Management under Metadata in the Repository of the Integration perspective of your Studio.

Note

This step is necessary to validate these changes and take them into account at runtime. If the rule package of the same name exists already in the Repository, these changes will overwrite it once validated, otherwise the Repository one takes the priority during execution.

 

Rule table

Complete this table to create a complete survivor validation flow. Basically, each given rule is defined as an execution step, so in the top-down order within this table, these rules form a sequence and thus a flow takes shape. The columns of this table are:

Order: From the list, select the execution order of the rules you are creating so as to define a survivor validation flow. The types of order may be:

  • Sequential: a Sequential rule is an execution step of the survivor validation flow. For example, the first rule on the top of this Rule table will be the first step and from this rule down, the second Sequential rule will be the second step.

    The first rule on the top must be a Sequential rule.

  • Multi-condition: a Multi-condition rule is an additional rule to a given execution step. It is always added to the last Sequential rule above it in this table and then at this step, both of these two rules become necessary to respect. For example, having defined the first Sequential rule, you define a Multi-condition rule below; then both of them will become the rules of the first step.

  • Multi-target: as each step, once executed, validates a record field value from a given Reference column and select the corresponding value as the best from a given Target column, a Multi-target rule allows you to add one more Target column to the same step.

    You need to define each Reference column and Target column manually in this table.

Rule Name: Type in the name of each rule you are creating. This column is only available to the Sequential rules as they define the steps of the survivor validation flow.

Reference column: Select the column you need to apply a given rule on. They are the columns you have defined in the schema of this component. This column is not available to the Multi-target rules as they define only the Target column.

Function: Select the type of validation operation to be performed on a given Reference column. The available types include:

  • None: no validation operation is performed.

  • Most common: it validates the most frequent field value in each duplicates group.

  • Most recent or Most ancient: the former validates the earliest date value and the latter the latest date value in each duplicates group. The relevant reference column must be of the Date type.

  • Longest or Shortest: the former validates the longest field value and the latter the shortest in each duplicates group.

  • Largest or Smallest: the former validates the largest numerical value and the latter the smallest numerical value in a duplicates group.

  • Match regex: it validates the field when this field complies to the regular expression given in the Value column .

  • Expression: it validates the field when it complies to the expression that you enter in the Value column. The expression value must be written with the Drools language.

  • Most Complete: it validates the field when the record it belongs to has the least empty fields.

Value: enter the expression of interest corresponding to the Match regex or the Expression function you have selected in the Function column.

Target column: when a step is executed, it validates a record field value from a given Reference column and selects the corresponding value as the best from a given Target column. Select this Target column from the schema columns of this component.

Ignore blanks: Select the check boxes which correspond to the names of the columns for which you want the blank value to be ignored.

 Advanced settings

tStatCatcher Statistics

Select this check box to collect log data at the Job and the component levels.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component requires an input component and an output component.

As it needs grouped data to process, this component works straightforwardly alongside the components like tMatchGroup as it requires a group identifier column and a group size column.

It also requires that the input data are sorted by the group identifier and that the first row of a group contains the group size.

When you export a Job using tRuleSurvivorship, you need to select the Export dependencies check box in order to export the generated survivor validation rules together. For further information about how to export a Job, see Talend StudioUser Guide.