Cleansing customer contact information - 6.2

Talend MDM Platform Getting Started Guide

EnrichVersion
6.2
EnrichProdName
Talend MDM Platform
task
Administration and Monitoring
Data Governance
Data Quality and Preparation
Installation and Upgrade
EnrichPlatform
Talend Administration Center
Talend CommandLine
Talend DQ Portal
Talend ESB
Talend Installer
Talend MDM Web UI
Talend Runtime
Talend Studio

After profiling customer data and identifying its problems, some actions should be taken on data to cleanse it. You may start by generating out-of-the-box Talend Jobs. These Jobs remove duplicates from the columns you analyze. The Jobs also remove the values that do not match the patterns used in the analyses.

This helps you identify the issues you have in the address data and what you need to resolve.

Removing duplicate values

The profiling results of the column analysis show that there are some duplicate records in the email and phone columns. Check Showing analysis results for detail.

From the analysis results, you can generate out-of-box Jobs that separate unique from duplicate records in the selected columns. Such Jobs output all the duplicates in a reject delimited file by default, and writes the unique values in the database used in the analysis.

Prerequisites:

  • You have opened the Profiling perspective in the Studio.

  • You have created and executed the column analysis. For further information, see Identifying anomalies in data.

You can follow the same procedure to remove duplicates from the email or phone columns.

  1. Open the column analysis in the Profiling perspective and click Analysis Results at the bottom of the editor.

  2. In the Simple Statistics results of the email or phone column, right-click Duplicate Count and select Identify duplicates.

    This example uses the outcome of the simple statistics used on the email column.

    The Integration perspective opens showing the generated Job, and the Job is listed in the Repository tree view.

    The tMysqlInput, tUniqueRow and tMysqlOutputBulkExec components are automatically configured according to your connection and the columns you are analyzing. tMysqlOutputBulkExec writes unique records to a new table in MySQL and tFileOutputDelimited writes duplicate records in an output delimited file.

    For more information on these components, see Talend Components Reference Guide.

  3. Press F6 to execute the Job.

    Duplicate values are written to the output file and unique records are written to a new table in the gettingstarted database in MySQL.

Removing non-matching values

The results of the patterns used on the email and phone columns show that some records do not respect the standard email and phone formats. Check Showing analysis results for detail.

From the analysis results, you can generate out-of-box Jobs to recuperate the non-matching rows from the columns.

Prerequisites:

  • You have opened the Profiling perspective in the Studio.

  • You have created and executed the column analysis. For further information, see Identifying anomalies in data.

You can follow the same procedure to remove non-matching values from the email or phone columns.

  1. Open the column analysis in the Profiling perspective and click Analysis Results at the bottom of the editor.

  2. In the Pattern Matching tables of the email or phone column, right-click the results and select Generate Job.

    This example uses the results of the US Phone numbers pattern used on the phone column.

  3. In the wizard that opens, click Finish to confirm the creation of the Job.

    The Integration perspective opens showing the generated Job, and the Job is listed in the Repository tree view.

    This Job uses the Extract Transform Load process to write in two separate output files the phone rows that match and do not match the pattern.

    The tMysqlInput is automatically configured according to your connection and tPatternCheck is automatically configured according the column you analyze.

    For more information on these components, see Talend Components Reference Guide.

  4. Double-click each of the output component and change the default name or path of the output files, if needed.

  5. Press F6 to execute the Job.

    Matching and non-matching phone numbers are written to two separate output files.

  6. Right-click each of the tFileOutputDelimited components and select Data Viewer to open a view on the data which matches and that which does not match the phone pattern.

    You can then design a Job, for example, to standardize the phone numbers which match the pattern and give them the correct international format by using the tStandardizePhoneNumber component.