Adding a Grouping campaign to identify duplicate pairs - 6.5

Talend Data Stewardship Examples

Talend Documentation Team
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Administration and Monitoring > Managing users
Data Governance > Assigning tasks
Data Governance > Managing campaigns
Data Governance > Managing data models
Data Quality and Preparation > Handling tasks
Talend Data Stewardship

A Grouping campaign defines a list of possible arbitration choices for pairs or groups of records. The outcome of a grouping task is the choice made by data stewards on the group of records.

A typical use case for this campaign is to label suspect duplicate pairs in the context of matching very high volume of data using machine learning on Spark. Another use case for the campaign is to identify the groups of potential duplicates before sending them to a Merging campaign where data stewards can merge duplicates into master records.

The Grouping campaign in this example is used in the process of machine learning on Spark. It identifies duplicates in a sample data extracted from a long list of early childhood education centers in Chicago coming from ten different sources. This step in data matching comes after computing suspect duplicates in the agencies list by using the tMatchPairing component.

Once campaign owners create the campaign, data stewards need to look at the sample data and decide whether pairs of record are duplicate.

  • An administrator has created Talend Data Stewardship users and assigned them roles in Talend Administration Center. For further information, see Creating Data Stewardship users.

  • You have been assigned a campaign owner role in Talend Administration Center.

  • You have defined a data model for the campaign in Talend Data Stewardship.

  • You have accessed Talend Data Stewardship as a campaign owner.