Handling merging tasks to deduplicate records - Cloud

Talend Cloud Data Stewardship Examples

Version
Cloud
Language
English
Product
Talend Cloud
Module
Talend Data Stewardship
Content
Data Governance > Assigning tasks
Data Governance > Managing campaigns
Data Governance > Managing data models
Data Quality and Preparation > Handling tasks
Last publication date
2024-02-09

Merging tasks aim to merge several potential duplicates into one single record: master record. Potential duplicates can come from the same source (data deduplication) or from different sources (data reconciliation).

In a Merging campaign, you can only modify values in the master fields, values in the source fields can not be modified.

Merging data values and validating your modifications transition the task to the second state defined in the workflow. The workflow defined at the campaign creation determines which states are available to what data stewards. However, a task cannot be validated or even marked as ready as long as it contains at least one invalid value.

About this task

Customer duplicate records come from the same source (enterprise CRM). Talend Cloud Data Stewardship determines initially which attributes of matched records to use to create the master record according to the survivorship rules defined when creating the campaign. However, you may need to manually modify survivorship rules per record attribute or enter completely new values to reach the most accurate and reliable master records.

Procedure

  1. On the Tasks page, click the campaign name, CRM Data Deduplication in this example, to open a list of the tasks assigned to you.
    Overview of the CRM data deduplication campaign.
  2. Use the quality bar on top of each of the columns to filter the data on which you want to work in the Chart or Pattern views in the right-hand panel.
  3. Click the down arrow on the top-left corner to expand all tasks in the list, or click the down arrow of a specific task to expand it.
  4. Set survivorship rules to select attributes from customer records and use them to build the master records. Several approaches are possible.
    • Set a survivorship rule manually for one attribute of multiple records.

      1. Click a column heading, First_Name for example, and in the right-hand panel browse to the Survivorship section.
      2. Click Apply survivorship rule and from the Rule list, select Most common as the survivorship rule you want to apply to the name attribute in all the customer records.

        If you have defined in the Merging campaign the sources of the duplicate data, the sources names are included in the list and can be selected as the survivorship rule to apply to the column values.

      3. If you want to apply the rule to all name values including null ones, clear the Avoid null values check box, otherwise leave it selected.
      4. Click Submit to select the most common name values and add them to the master records of the tasks.
    • Set a survivorship rule manually for all attributes of one or multiple golden records.

      1. Select the tasks for which to set the rule, and under Task in the right-hand panel click Apply survivorship rule.
      2. From the Selection list, click Selected tasks.

        You can apply the rule to all tasks or only to the filtered tasks if you have defined a filter on the list.

      3. From the Rule list, select to apply Most trusted for example to the group of selected tasks.
      4. If you want to apply the rule to all values including null ones, clear the Avoid null values check box, otherwise leave it selected.
      5. Click Submit to add the name values with the highest score to the selected golden records.
    • Set a survivorship rule manually for one or several attributes of a record: expand the task and hover over an attribute in the master record of a task and from the icons which display, select the survivorship rule you want to apply.
      Location of the icons to set a survivorship rule manually for one or several attributes.
      • Use first valid attribute icon: selects the first valid attribute value among the duplicates. "First" is defined by the order of the records when the task is created.

      • Use most common icon: selects the most common attribute value among the duplicates.

      • Use most recent icon: selects the most recent attribute value among the duplicates.

      • Use most trusted icon: selects the most trusted attribute value among the duplicates coming from different sources.

        Icons are grayed out when rules are not applicable on the selected attribute. In this example, the icon for the most trusted attribute is not functional since customer data comes from one single source: CRM.

    • Set a survivorship rule manually for one attribute of multiple records.

      1. Click a column heading, First_Name for example, and in the right-hand panel browse to the Survivorship section.
      2. Click the Apply survivorship rule... and from the Rule list, select Most common as the survivorship rule you want to apply to the name attribute in all the customer records.
      3. Click Submit to select the most common name values and add them to the master records of the tasks.
    • Select the value of a given source attribute to be the value for the master record: point to a source attribute and click the up arrow to set the selected value in the master record.
  5. Optionally, click the email link in the Email column to open a new window and send an email to the customer about any necessary validation of the information in the customer data record.
    Note: Email addresses will display as hyperlinks only if you set the semantic type for the Email column to MailTo URL while defining the data model for the campaign.
  6. Repeat the above step to merge records and create master records for all the tasks assigned to you.
    If a given column has some values which need to be fixed, you can bulk transform them by using the functions listed in the right panel.
  7. Click the Mark the task as ready for validation icon icon next to the data record you modified to mark the task as ready to be validated.
    When the lock icon has a red background color, you must first correct the invalid value in the task before being able to mark it as ready to be validated.

    The record is marked with green background and the lock icon is automatically moved to the next record. You can remodify the records ready to be validated, but this puts the task back to its initial state with a dark gray background color. You need to reclick the lock icon to mark the task as ready for validation.

  8. Click Validate in the top-right corner of the page to validate the modifications you have done on the records.
    Master records are created and the records which are validated are moved from the list and transitioned to the next step in the workflow where they need to be approved by another data steward. In this example, they are moved to the list of the data steward who is granted the Account manager role.
  9. The data stewards with the Account manager role, access the tasks to be validated and decide to accept or reject the choices done on the tasks.

Results

Approved tasks are transitioned to the Resolved state in the workflow. Rejected tasks are transitioned back to the initial step in the workflow and marked as new.