Handling merging tasks to deduplicate records - 6.4

Talend Data Stewardship Examples

author
Talend Documentation Team
EnrichVersion
6.4
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
task
Administration and Monitoring > Managing users
Data Governance > Assigning tasks
Data Governance > Managing campaigns
Data Governance > Managing data models
Data Quality and Preparation > Handling tasks
EnrichPlatform
Talend Data Stewardship

Merging tasks aim to merge several potential duplicate records into one single record: master record. Potential duplicate records can come from the same source (data deduplication) or from different sources (data reconciliation).

In a Merging campaign, you can only modify values in the master fields, values in the source fields can not be modified.

Merging data values and validating your modifications transition the task to the second state defined in the workflow. The workflow defined at the campaign creation determines which states are available to what data stewards. However, a task cannot be validated or even marked as ready as long as it contains at least one invalid value. And this guarantees that data which does not match the data model can not go out of Talend Data Stewardship.

Before you begin

Procedure

  1. On the TASKS page, click the campaign name, CRM Data Deduplication in this example, to open a list of the tasks assigned to you.
    Here, customer duplicate records come from the same source (enterprise CRM). Talend Data Stewardship determines initially which attributes of matched records to use to create the master record according to the survivorship rules defined when creating the campaign. However, you may need to manually modify survivorship rules per record attribute or enter completely new values to reach the most accurate and reliable master records.
  2. Use the quality bar on top of each of the columns to filter the data on which you want to work in the CHARTS or PATTERN views in the right-hand panel.
  3. Click the down arrow on the top-left corner to expand all tasks in the list, or click the down arrow of a specific task to expand it.
  4. Set survivorship rules to select attributes from customer records and use them to build the master records. Several approaches are possible.
    • Set a survivorship rule manually for one or several attributes of a record: point to an attribute in the master record of a task and from the icons which display, select the survivorship rule you want to apply.

      • : selects the first valid attribute value among the duplicates. "First" is defined by the order of the records when the task is created.

      • : selects the most common attribute value among the duplicates.

      • : selects the most recent attribute value among the duplicates.

      • : selects the most trusted attribute value among the duplicates coming from different sources.

        Icons are grayed out when rules are not applicable on the selected attribute. In this example, the icon for the most trusted attribute is not functional since customer data comes from one single source: CRM.

    • Set a survivorship rule manually for one attribute of multiple records.

      1. Click a column heading, First_Name for example, and in the right-hand panel browse to the Survivorship section.
      2. Click the button and from the Survivorship rule list, select Most common as the survivorship rule you want to apply to the name attribute in all the customer records.
      3. Click Submit to select the most common name values and add them to the master records of the tasks.
    • Select the value of a given source attribute to be the value for the master record: point to a source attribute and click the up arrow to set the selected value in the master record.
  5. Repeat the above step to merge records and create master records for all the tasks assigned to you.
    If a given column has some values which need to be fixed, you can bulk transform them by using the functions listed in the right panel.

    For further information, see Transforming data in a column.

  6. Click the icon next to the data record you modified to mark the task as ready to be validated.
    When the lock icon has a red background color, you must first correct the invalid value in the task before being able to mark it as ready to be validated.

    The record is marked with green background and the lock icon is automatically moved to the next record. You can remodify the records ready to be validated, but this puts the task back to its initial state with a dark-grey background color. You need to reclick the lock icon to mark the task as ready for validation.

  7. Click VALIDATE CHOICES in the top-right corner of the page to validate the modifications you have done on the records.
    Master records are created and the records which are validated are moved from the list and transitioned to the TO VALIDATE step in the workflow where they need to be approved by another data steward. In this example, they are moved to the list of the data steward who is granted the ACCOUNT MANGAGER role.
  8. The data stewards with the ACCOUNT MANAGER role, access the tasks to be validated and decide to accept or reject the choices done on the tasks.

Results

Approved tasks are transitioned to the Resolved state in the workflow. Rejected tasks are transitioned back to the initial step in the workflow and marked as new.