Available in...Big Data Platform
Cloud API Services Platform
Cloud Big Data Platform
Cloud Data Fabric
Cloud Data Management Platform
Data Fabric
Data Management Platform
Data Services Platform
MDM Platform
Real-Time Big Data Platform
When you use the
Profiling
perspective to analyze a column in a database table and provide simple
statistics on the number of distinct, unique, and duplicate values, you can later generate
a ready-to-use Job that removes duplicate values from the specified column.
Before you begin
If you want to connect to an Oracle database, install an Oracle database
client.
Procedure
-
From the
Profiling
perspective:
-
Create a column analysis. For detailed steps on how to create a column
analysis, see Defining the columns to be analyzed and setting indicators.
-
Set the simple statistics indicators on the specified column.
-
If the analysis is connected to a MySQL 8 database, in the command line of the
MySQL database server, define SET GLOBAL local_infile =
'ON';.
-
Execute the analysis to open the results in the Analysis
Results view.
-
Click the name of the analyzed column in which you want to separate unique and
duplicate values, and then click Simple Statistics to
expand the simple statistics section.
-
In the Label list, right-click Distinct
Count, Unique Count or
Duplicate Count and then select Identify
duplicates from the contextual menu.
The
Integration
perspective opens in Talend Studio showing
the generated Job with the corresponding components.
The database input component and the tUniqRow components
are already configured according to your connection and the column you are
analyzing.
The two output components are file components in this ready-to-use Job, but
you can replace them with database output components to write the duplicate
and distinct values directly in the desired database.
-
Configure the two output components:
-
Double-click the first output component with the
Uniques connection and in the
Local Filename field, set the path to the
output file that will hold the distinct values.
-
Double-click the second output component with the
Duplicates connection and in the
File Name field, set the path to the output
file that will hold the duplicate values.
-
Save the Job and press F6 to execute it.
Distinct and duplicate values are written to the specified output
files.
-
If required, right-click tFileOutputdelimited in
the generated Job and select Data Viewer.
A preview of the standardized data is open in Talend Studio.