Identifying data anomalies

Talend Data Fabric Getting Started Guide

EnrichVersion
6.1
EnrichProdName
Talend Data Fabric
task
Design and Development
Data Quality and Preparation
Data Governance
EnrichPlatform
Talend Studio

The first step in this example is to profile the customer contact information in a MySQL database. The profiling results provides you with statistics about the values within each column.

How to profile address columns

You will use the Profiling perspective of the studio to analyze few customer columns including email and postal. Using out-of-box indicators and patterns on these columns, you can show in the analysis results the matching and non-matching address data, the number of most frequent records for each distinct pattern and the row, duplicate and blank counts in each column.

Defining the column analysis

  1. In the DQ Repository tree view, right-click the Analysis folder and select New Analysis.

    The [Create New Analysis] wizard opens.

  2. Start typing Basic column analysis in the search field, select Basic Column Analysis from the list and click Next.

  3. In the Name field, enter a name for the current column analysis.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  4. Set column analysis metadata (purpose, description and author name) in the corresponding fields and click Next.

Selecting the address columns and setting sample data

  1. Expand DB connections and browse to the address columns you want to analyze.

  2. Select the columns and click Finish to close the wizard.

    A file for the newly created column analysis is listed under the Analysis node in the DQ Repository tree view, and the analysis editor opens with the analysis metadata.

  3. In the Data preview view, click Refresh Data.

    The data in the selected columns is displayed in the table.

    You can change your data source and your selected columns by using the New Connection and Select Data buttons respectively.

  4. In the Limit field, set to 50 the number for the data records you want to display in the table and use as sample data.

  5. Select n random rows to list 50 random records from the selected columns.

    For further information on column analysis, see the chapter about column analysis in Talend Studio User Guide.

Setting system indicators

  1. From the Data preview view in the analysis editor, click Select indicators to open the [Indicator Selection] dialog box.

  2. Click in the cells next to indicators names to set indicator parameters for the analyzed columns and click OK.

    You want to see the row, blank and duplicate counts in all columns to see how consistent the data is. Also you want to use the Pattern Frequency Table indicator on the email and postal columns in order to compute the number of most frequent records for each distinct pattern or value.

    Indicators are added accordingly to the columns in the Analyzed Columns view.

  3. Click the option icon next to the Blank Count indicator and set 0 in the Upper threshold field.

    Defining thresholds on indicators is very helpful as it will write in red the count of the null values in the analysis results.

    For further information on indicator types and their usage when analyzing data, see Talend Studio User Guide.

Setting patterns

You would want now to match the content of the email column against a standard email format and the postal column against a standard US zip code format. This will define the content, structure and quality of emails and zip codes and give a percentage of the data that match the standard formats and the data that does not match.

  1. In the Analyzed Columns view, click the icon next to email.

  2. In the [Pattern Selector] dialog box, expand Regex and browse to Email Address in the internet folder, and then click OK.

  3. Click the option icon next to the Email Address indicator and set 98.0 in the Lower threshold (%) field.

    If the number of the records that match the pattern is fewer than 98%, it will be written in red in the analysis results.

  4. Do the same to add to the postal column the US Zipcode Validation pattern from the address folder.

    For further information on pattern types and their usage when analyzing data, see Talend Studio User Guide.

Executing the analysis and displaying the profiling results

  1. Save the column analysis in the analysis editor and then press F6 to execute it.

    A group of graphics is displayed in the Graphics panel to the right of the analysis editor showing the results of the column analysis including those for pattern matching.

  2. Click the Analysis Results tab at the bottom of the analysis editor to access a more detail result view.

    These results show the generated graphics for the analyzed columns accompanied with tables that detail the statistic and pattern matching results.

    The results for the email column look as the following:

    The pattern matching results show that about 10% of the email records do not match the standard email pattern. The simple statistic results show that about 8% of the email records are blank and that about 5% are duplicates. And the pattern frequency results give the number of most frequent records for each distinct pattern. This shows that the data is not consistent and you need to correct and cleans the email data before starting your campaign.

    The results for the postal column look as the following:

    The result sets for the postal column give the count of the records that match and those that do not match a standard US zip code format. The results sets also give the blank and duplicate counts and the number of most frequent records for each distinct pattern. These results show that the data is not very consistent.

    Then some percentage of the customers can not be contacted by either email or US mail service. These results show clearly that your data is not very consistent and that it needs to be corrected.

How to view analyzed data

After running the column analysis using the SQL engine and from the Analysis Results view of the analysis editor, you can right-click any of the rows/bars in the result tables/charts and access a view of the actual analyzed data. This could be very helpful to see invalid rows for example and start analyzing what needs to be done to clean such data.

To view and export the analyzed data, do the following:

  1. At the bottom of the analysis editor, click the Analysis Results tab to open a detailed view of the analysis results.

  2. Right-click the data row in the statistic results of the Email column and select View rows for example.

The Data Explorer perspective opens listing the invalid rows in the email column.