Detecting anomalies in columns (Functional Dependency Analysis) - 6.5

Talend Open Studio for MDM User Guide

EnrichVersion
6.5
EnrichProdName
Talend Open Studio for MDM
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This type of analysis helps you to detect anomalies in column dependencies through defining columns as either "determinant" or "dependent" and then analyzing values in dependant columns against those in determinant columns. This analysis supports only database tables.

This type of analysis detects to what extent a value in a determinant column functionally determines another value in a dependant column.

This can help you identify problems in your data, such as values that are not valid. For example, if you analyze the dependency between a column that contains United States Zip Codes and a column that contains states in the United States, the same Zip Code should always have the same state. Running the functional dependency analysis on these two columns will show if there are any violations of this dependency.

Prerequisite(s): At least one database connection is set in the Profiling perspective of the studio. For further information, see Connecting to a database.

Defining the analysis

In the DQ Repository tree view, expand Data Profiling.

  1. Right-click the Analyses folder and select New Analysis.

    The [Create New Analysis] wizard opens.

  2. In the filter field, start typing functional dependency analysis, select Functional Dependency Analysis from the list and click Next.

  3. In the Name field, enter a name for the current analysis.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  4. Set column analysis metadata (purpose, description and author name) in the corresponding fields and click Next to proceed to the next step.

Selecting the columns as either "determinant" or "dependent"

  1. Expand DB connections, browse to the database you want to analyze, select it and then click Finish to close the [New Analysis] wizard.

    The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under Analyses in the DQ Repository tree view.

    The Data Preview section shows a sample data of all the table columns.

  2. In the Left Column panel, click A Columns Set to open the [Column Selection] dialog box.

    Here you can select the first set of columns against which you want to analyze the values in the dependant columns. You can also drag the columns directly from the DQ Repository tree view to the left column panel.

    In this example, you want to evaluate the records present in the city column and those present in the state_province column against each other to see if state names match to the listed city names and vice versa.

  3. In the [Column Selection] dialog box, expand DB Connections and browse to the column(s) you want to define as determinant columns.

    You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists will show only the tables/columns that correspond to the text you type in.

  4. Select the check box(es) next to the column(s) you want to analyze and click OK to proceed to the next step.

    The selected column(s) are displayed in the Left Columns panel of the Analyzed Columns Set view. In this example, we select the city column as the determinant column.

  5. Do the same to select the dependant column(s) or drag it/them from the DQ Repository tree view to the Right Columns panel. In this example, we select the state_province column as the dependent column. This relation will show if the state names match to the listed city names.

    If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected column is automatically located under the corresponding connection in the tree view.

  6. Click the Reverse columns tab to automatically reverse the defined columns and thus evaluate the reverse relation, what city names match to the listed state names.

    You can select to connect to a different database by selecting another connection from the Connection list in the Data Preview section. This list shows all the connections created in the Studio. If the columns listed in the Analyzed Columns Set view do not exist in the new database connection you want to set, you will receive a warning message that enables you to continue or cancel the operation.

Finalizing and executing the functional dependency analysis

  1. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.

  2. In the Analysis Parameter view and in the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to the selected database, if required.

    You can set this number according to the database available resources, that is the number of concurrent connections each database can support.

  3. If you have defined context variables in the Contexts view in the analysis editor:

    • use the Data Filter and Analysis Parameter views to set/select context variables to filter data and to decide the number of concurrent connections per analysis respectively.

    • In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.

    For further information about contexts and variables, see Using context variables in analyses.

  4. Save the analysis and press F6 to execute it.

    An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis Results view.

    This functional dependency analysis evaluated the records present in the city column and those present in the state_province column against each other to see if the city names match to the listed state names and vice versa. The returned results, in the %Match column, indicate the functional dependency strength for each determinant column. The records that do not match are indicated in red.

    The #Match column in the result table lists the numbers of the distinct determinant values in each of the analyzed columns. The #row column in the analysis results lists the actual relations between the determinant attribute and the dependant attribute. In this example, #Match in the first row of the result table represents the number of distinct cities, and #row represents the number of distinct pairs (city, state_province). Since these two numbers are not equal, then the functional dependency relationship here is only partial and the ratio of the numbers (%Match) measures the actual dependency strength. When these numbers are equal, you have a "strict" functional dependency relationship, that is to say each city appears only once with each state.

    Note

    The presence of null values in either of the two analyzed columns will lessen the "dependency strength". The system does not ignore null values, but rather calculates them as values that violates the functional dependency.

  5. In the Analysis Results view, right-click any of the dependency lines and select:

    Option

    To...

    View valid/invalid rows

    access a list in the SQL editor of all valid/invalid rows measured according to the functional dependencies analysis

    View valid/invalid values

    access a list in the SQL editor of all valid/invalid values measured according to the functional dependencies analysis

    View detailed valid/detailed invalid values

    access a detailed list in the SQL editor of all valid/invalid values measured according to the functional dependencies analysis

    From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed on indicators.