Removing duplicate values

Talend Data Fabric Getting Started Guide

EnrichVersion
6.1
EnrichProdName
Talend Data Fabric
task
Data Governance
Design and Development
Data Quality and Preparation
EnrichPlatform
Talend Studio

After analyzing the email and postal columns using simple statistics indicators, the analysis results show the number of duplicate records in the columns. You can generate a ready-to-use Job on the analysis results. This Job removes duplicate values in the selected column.

To remove duplicate values from the email column:

  1. In the Profiling perspective, click Analysis Results at the bottom of the editor.

  2. In the Simple Statistics results of the email column, right-click the duplicate count bar in the chart and select Remove duplicates.

    The Integration perspective opens in the studio showing the generated Job with the corresponding components. For more information on such components, see Talend Components Reference Guide.

    The database input component and the tUniqueRow components are already configured according to your connection and the columns you are analyzing.

  3. Save the Job and press F6 to execute it.

    Duplicate values are written to the specified output database and file.

    You can follow the same procedure to remove duplicates from the postal column.

    For further information on using the Profiling perspective to identify and remove corrupt, incomplete or inaccurate data, see the chapter about data cleansing in Talend Studio User Guide.