The profiling results of the column analysis show that there are some duplicate records in the email and phone columns. Check Showing analysis results for detail.
From the analysis results, you can generate out-of-box Jobs that separate unique from duplicate records in the selected columns. Such Jobs output all the duplicates in a reject delimited file by default, and writes the unique values in the database used in the analysis.
You can follow the same procedure to remove duplicates from the Email or Phone columns.
Before you begin
You have opened the Profiling perspective in Talend Studio.
You have created and executed the column analysis. For further information, see Identifying anomalies in data.
- Open the column analysis in the Profiling perspective and click Analysis Results at the bottom of the editor.
In the Simple Statistics results of the
Email or Phone column, right-click Duplicate
Count and select Identify
This example uses the outcome of the simple statistics used on the Email column.
The Integration perspective opens showing the generated Job, and the Job is listed in the Repository tree view.
The tMysqlInput, tUniqueRow and tMysqlOutputBulkExec components are automatically configured according to your connection and the columns you are analyzing. tMysqlOutputBulkExec writes unique records to a new table in MySQL and tFileOutputDelimited writes duplicate records in an output delimited file.
- Press F6 to execute the Job.
Duplicate values are written to the output file and unique records are written to a new table in the gettingstarted database in MySQL.