Generating a Job to Identify duplicate values in an analyzed column - Cloud - 8.0

Talend Studio User Guide

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-02-29
Available in...

Big Data Platform

Cloud API Services Platform

Cloud Big Data Platform

Cloud Data Fabric

Cloud Data Management Platform

Data Fabric

Data Management Platform

Data Services Platform

MDM Platform

Real-Time Big Data Platform

When you use the Profiling perspective to analyze a column in a database table and provide simple statistics on the number of distinct, unique, and duplicate values, you can later generate a ready-to-use Job that removes duplicate values from the specified column.

Before you begin

If you want to connect to an Oracle database, install an Oracle database client.

Procedure

  1. From the Profiling perspective:
    1. Create a column analysis. For detailed steps on how to create a column analysis, see Defining the columns to be analyzed and setting indicators.
    2. Set the simple statistics indicators on the specified column.
  2. If the analysis is connected to a MySQL 8 database, in the command line of the MySQL database server, define SET GLOBAL local_infile = 'ON';.
  3. Execute the analysis to open the results in the Analysis Results view.
  4. Click the name of the analyzed column in which you want to separate unique and duplicate values, and then click Simple Statistics to expand the simple statistics section.
    Contextual menu of a label from the Simple Statistics section.
  5. In the Label list, right-click Distinct Count, Unique Count or Duplicate Count and then select Identify duplicates from the contextual menu.

    The Integration perspective opens in Talend Studio showing the generated Job with the corresponding components.

    Job using the tMysqlInput, tUniqRow, tMysqlOutputBulkExec, and tFileOutputDelimited components.

    The database input component and the tUniqRow components are already configured according to your connection and the column you are analyzing.

    The two output components are file components in this ready-to-use Job, but you can replace them with database output components to write the duplicate and distinct values directly in the desired database.

  6. Configure the two output components:
    1. Double-click the first output component with the Uniques connection and in the Local Filename field, set the path to the output file that will hold the distinct values.
    2. Double-click the second output component with the Duplicates connection and in the File Name field, set the path to the output file that will hold the duplicate values.
    3. Save the Job and press F6 to execute it.
      Distinct and duplicate values are written to the specified output files.
    4. If required, right-click tFileOutputdelimited in the generated Job and select Data Viewer.
      A preview of the standardized data is open in Talend Studio.
      Overview of a preview of analyzed data.