Identifying anomalies in data - 6.2

Talend MDM Platform Getting Started Guide

EnrichVersion
6.2
EnrichProdName
Talend MDM Platform
task
Administration and Monitoring
Data Governance
Data Quality and Preparation
Installation and Upgrade
EnrichPlatform
Talend Administration Center
Talend CommandLine
Talend DQ Portal
Talend ESB
Talend Installer
Talend MDM Web UI
Talend Runtime
Talend Studio

The use case explains how to use the Profiling perspective of the studio to analyze customer email addresses and phone numbers. It uses out-of-box indicators and patterns on the columns and shows the matching and non-matching address data. Talend Jobs are then generated on the analysis results to clean customer data and monitor its evolution.

You can then use the Data Explorer perspective to browse the non-matching data.

The sequence of profiling and cleansing customer data involves the following steps:

  1. Create a column analysis on customer email addresses and phone numbers. For further information, see Defining a column analysis.

  2. Connect to the database which holds the customer data from the analysis editor. For further information, see Creating the database connection.

  3. Add indicators to provide simple statistics on data such as row , blank and duplicate counts. For further information, see Setting system indicators.

  4. Add standard patterns against which to match email addresses and phone numbers. For further information, see Setting patterns.

  5. Execute the analysis to show results in tables and charts. For further information, see Showing analysis results.

  6. Access a view of the analyzed data to see invalid records. For further information, see Browsing non-match data.

  7. Generate out-of-box Jobs from analysis results to remove duplicate values from the email and phone columns. For further information, see Removing duplicate values.

  8. Generate out-of-box Jobs from analysis results to remove values which do not respect the standard email format or phone number format from the email and phone columns respectively. For further information, see Removing non-matching values.

Defining a column analysis

You want to create a column analysis from the Profiling perspective of the Studio to examine the Email and Phone columns in a MySQL databases and collect statistics on them. The analysis runs on several columns but each column is analyzed separately and independently.

Prerequisites:

  • You have opened the Profiling perspective in the Studio.

  1. In the DQ Repository tree view, right-click Analyses and select New Analysis.

    The [Create New Analysis] wizard opens.

  2. Start typing Basic column analysis in the search field, select Basic Column Analysis from the list and click Next.

  3. In the Name field, enter a name for the analysis.

    The Name field is mandatory. Do not use spaces or special characters in the analysis name.

  4. Set a purpose and a description for the analysis, and click Finish to open the analysis editor.

    The Purpose and Description fields are not mandatory, but you are advised to fill in this information which is displayed in Detail View when you select the analysis.

    The new analysis is listed under the Analysis folder in the DQ Repository tree view.

Creating the database connection

Before you proceed to analyze customer data, stored in the MySQL database in this example, you must first set up the connection to the database.

Prerequisites:

  • You have opened the Profiling perspective in the Studio.

  • You have created a column analysis and opened the analysis editor. For further information, see Defining a column analysis.

  • You have imported the gettingstarted .sql file which holds the customer data into a MySQL database to which you have access. For further information, see Setting up input data.

  1. In the analysis editor, click the New Connection tab to open the [Create New Connection] wizard.

  2. From the Connection Type list, select DB connections and click Next.

  3. Enter a name for the database connection, gettingstarted in this example.

    The Name field is mandatory. Do not use spaces or special characters in the connection name.

  4. Set a purpose and a description for the connection and click Next.

    The Purpose and Description fields are not mandatory, but you are advised to fill in this information which is displayed in Detail View when you select the connection.

  5. From the DB Type and DB Version lists, select MySQL for this example and the database version respectively.

  6. Enter your login information and database detail in the corresponding fields and click the Check button to verify if your connection is successful.

    If you have not already installed the database driver (.jar file) necessary to use the database, you will have a wizard prompting you to install the relative third-party module, click Download and Install and then close the wizard.

  7. Click Finish to create the database connection, list it under the Metadata node and open a new step in the wizard.

  8. Expand the database connection, click on the table name and select the check boxes of the columns on which you want to create the analysis.

  9. Click OK to close the wizard and list the columns in the analysis editor.

    You can click Refresh Data to display the actual data in the analysis editor.

Setting system indicators

This column analysis uses out-of-box indicators to provide simple statistics such as row, blank and duplicate counts on the Email and Phone columns.

Prerequisites:

  1. In the Data Preview section in the analysis editor, click Select indicators to open the [Indicator Selection] dialog box.

  2. Expand Simple Statistics and select the row, duplicate and blank counts. Click OK to close the wizard.

    You want to see the row, blank and duplicate counts in the Email and Phone columns to see how consistent the data is.

    Indicators are added accordingly to the columns in the Analyzed Columns section.

  3. Click the icon next to the Duplicate Count and Blank Count indicator and set 0 in the Upper threshold field.

    Defining thresholds on the Email and Phone columns is very helpful as it will write in red the count of the duplicate and blank values in the analysis results.

    For further information on indicator types and their usage when analyzing data, see Talend Studio User Guide.

Setting patterns

This column analysis uses predefined patterns to match the content of the Email and Phone columns against standard email and US phone patterns respectively. This defines the content, structure and quality of emails and phone numbers and give a percentage of the data that match the standard formats and the data that does not match.

Prerequisites:

  1. In the Data Preview section in the analysis editor, click the icon next to the Email column to open the [Pattern Selector] dialog box.

    Expand Regex > internet, select the Email Address check box and click OK to close the dialog box.

    The pattern is added to the column in the Analyzed Columns section.

  2. Click the icon next to the Phone column to open the [Pattern Selector] dialog box.

    Expand Regex > phone, select the US phone numbers check box and click OK to close the dialog box.

    The pattern is added to the column in the Analyzed Columns section.

  3. Click the icon next to the Email Address and US phone numbers patterns and set 98.0 in the Lower threshold (%) fields.

    If the number of the records that match the patterns is fewer than 98%, it will be written in red in the analysis results.

    For further information on pattern types and their usage when analyzing data, see Talend Studio User Guide.

Showing analysis results

Once you finalize creating the column analysis and setting the indicators and patterns, you can execute it and display analysis results in tables and charts.

Prerequisites:

  • You have opened the Profiling perspective in the Studio.

  • You have created a column analysis. For further information, see Identifying anomalies in data.

  1. In the Analysis Parameters, select java from the Execution engine list to run the analysis with the Java engine.

    For further information on the analysis execution engines, see Talend Studio User Guide.

  2. In the analysis editor, press F6 to execute the analysis or click the Run button.

    The editor switches to the Analysis Results view. The analysis results show the generated charts for the analyzed columns accompanied with tables that detail the statistic and pattern matching results.

    The results for the Email column look as the following:

    The results for the Phone column look as the following:

    The result sets for the Email and Phone columns give the count of the records that match and those that do not match the standard email pattern and the standard US phone numbers respectively. The results also give the blank and duplicate counts. This shows that the data is not very consistent and that it needs to be corrected.