When you use the Profiling perspective of Talend Studio to analyze a column in a database table and provide simple statistics on the number of distinct, unique and duplicate values, you can later generate a ready-to-use Job that removes duplicate values from the specified column.
For more information about different types of indicators in the simple statistics group, see Simple statistics.
- From the Profiling perspective of Talend Studio, create a column analysis and set the simple statistics indicators on the specified column. For detailed steps on how to create a column analysis, see Defining the columns to be analyzed and setting indicators.
- Execute the analysis to open the results in the Analysis Results view.
Click the name of the analyzed column in which you want to separate unique and
duplicate values, and then click Simple Statistics to
expand the simple statistics section.
In the Label list, right-click Distinct
Count, Unique Count or
Duplicate Count and then select Identify
duplicates from the contextual menu.
The Integration perspective opens in the Studio showing the generated Job with the corresponding components. For more information on such components, see Talend Components Reference Guide.
The database input component and the tUniqueRow components are already configured according to your connection and the column you are analyzing.
The two output components are file components in this ready-to-use Job, but you can replace them with database output components to write the duplicate and distinct values directly in the desired database.
Configure the two output components:
- If you use a SQLite database, an error occurs during the execution. As SQLite locks the database while writing, replace the tDBOutput component by a tFileOutputDelimited component.
- Double-click the first output component with the Uniques connection and in the Local Filename field, set the path to the output file that will hold the distinct values.
- Double-click the second output component with the Duplicates connection and in the File Name field, set the path to the output file that will hold the duplicate values.
Save the Job and press F6 to execute it.
Distinct and duplicate values are written to the specified output files.
If required, right-click tFileOutputdelimited in
the generated Job and select Data Viewer.
A preview of the standardized data is open in Talend Studio.