Detecting incorrect phone numbers using patterns - Cloud

Talend Cloud Data Preparation User Guide

author
Talend Documentation Team
EnrichVersion
Cloud
EnrichProdName
Talend Cloud
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation

The pattern tab of the profiling area can be used to detect data quality issues by easily spotting low-frequency patterns in the data.

This example will illustrate a use case where patterns analysis will prove useful to fix data. The following dataset contains phone numbers from customers all around the world, with their many formats. As shown by the quality bar, some of those phone numbers are considered invalid. Using pattern analysis, you will find the nature of the error within the column.

Procedure

  1. Click the header of the phone column to select its content.
  2. In the profiling area of the dataset, select the Pattern tab.
    The different patterns used in this column are displayed in the form of a chart. By default the chart shows the repartition of the data using word-based patterns. When there are more than 15 different values or patterns to display in the data profiling area, you can browse between all of them with the pagination system.

    You will notice that among all the numbers, that should only contains [number] patterns, an anomaly stands out. Indeed, one bar at the bottom of the chart, shows that a record contains a [word].

  3. Click the bar corresponding the lowest frequency pattern in this dataset.
    By doing this, you have applied a filter on the corresponding row to isolate the error. The preparation now only displays the row with the Jeffords(323) 254-9541 value that indeed corresponds to the [word]([number]) [number]-[number] format.

    You can see that part of the full name from the previous column has been mixed up in the phone number, likely due to a human error like a bad copy and paste for example.

  4. Double click the cell to edit it and fix the value.
  5. In the filter bar, click the cross in the filter or click the garbage bin icon to clear the filter and display the whole dataset again.

Results

You have been able to identify and isolate a data quality problem by looking at the pattern repartition of your phone numbers.