Comparing identical columns in different tables - 6.2

Talend MDM Platform Studio User Guide

EnrichVersion
6.2
EnrichProdName
Talend MDM Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

From your studio, you can create an analysis that compares two identical sets of columns in two different tables. This redundancy analysis supports only database tables.

Prerequisite(s): At least one database connection is set in the Profiling perspective of the studio. For further information, see Connecting to a database.

Defining the analysis

  1. In the DQ Repository tree view, expand Data Profiling.

  2. Right-click the Analyses folder and select New Analysis.

    The [Create New Analysis] wizard opens.

  3. In the filter field, start typing redundancy analysis, select Redundancy Analysis from the list and click Next.

  4. In the Name field, enter a name for the current analysis.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  5. Set the analysis metadata (purpose, description and author name) in the corresponding fields and then click Next.

Selecting the identical columns you want to compare

  1. Expand DB connections and in the desired database, browse to the columns you want to analyze, select them and then click Finish to close the wizard.

    A file for the newly created analysis is listed under the Analysis folder in the DQ Repository tree view. The analysis editor opens with the defined analysis metadata.

    The display of the analysis editor depends on the parameters you set in the [Preferences] window. For more information, see Setting preferences of analysis editors and analysis results.

  2. Click Analyzed Column Sets to open the view where you can set the columns or modify your selection.

    In this example, you want to compare identical columns in the account and account_back tables.

  3. From the Connection list, select the database connection relevant to the database to which you want to connect.

    You can find in this list all the database connections you create and centralize in the Studio repository.

  4. Click A column Set to open the [Column Selection] dialog box.

  5. Browse the catalogs/schemas in your database connection to reach the table holding the columns you want to analyze.

    You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists will show only the tables/columns that correspond to the text you type in.

  6. Click the table name to list all its columns in the right-hand panel of the [Column Selection] dialog box.

  7. In the list to the right, select the check boxes of the column(s) you want to analyze and click OK to proceed to the next step.

    You can drag the columns to be analyzed directly from the DQ Repository tree view to the editor.

    If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected column will be automatically located under the corresponding connection in the tree view.

  8. Click B Column B Set and follow the same steps to select the second set of columns or drag it to the right column panel.

  9. Select the Compute only number of A rows not in B check box if you want to match the data from the A set against the data from the B set and not vice versa.

Finalizing and executing the analysis

  1. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.

  2. In the Analysis Parameter view and in the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to the selected database, if required.

    You can set this number according to the database available resources, that is the number of concurrent connections each database can support.

  3. If you have defined context variables in the Contexts view in the analysis editor:

    • use the Data Filter and Analysis Parameter views to set/select context variables to filter data and to decide the number of concurrent connections per analysis respectively.

    • In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.

    For further information about contexts and variables, see Using context variables in analyses.

  4. Save the analysis and press F6 to execute it.

    A confirmation message is displayed.

  5. Click OK if you want to continue the operation.

    The Analysis Results view opens showing the analysis results.

    In this example, 72.73% of the data present in the columns in the account table could be matched with the same data in the columns in the account_back table.

Through this view, you can also access the actual analyzed data via the Data Explorer.

To access the analyzed data rows, right-click any of the lines in the table and select:

Option

To...

View match rows

access a list of all rows that could be matched in the two identical column sets

View not match rows

access a list of all rows that could not be matched in the two identical column sets

View rows

access a list of all rows in the two identical column sets

Warning

The data explorer does not support connections which has empty user name, such as Single sign-on of MS SQL Server. If you analyze data using such connection and you try to view data rows in the Data Explorer perspective, a warning message prompt you to set your connection credentials to the SQL Server.

The figure below illustrates the data explorer list of all rows that could be matched in the two sets, eight in this example.

From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed on indicators.

The figure below illustrates the data explorer list of all rows that could not be matched in the two sets, three in this example.

For more information about the data explorer Graphical User Interface, see Main window of the data explorer.