Skip to main content Skip to complementary content

Comparing identical columns in different tables

From Talend Studio, you can create an analysis that compares two identical sets of columns in two different tables. This redundancy analysis supports only database tables.

Before you begin

At least one database connection is set in the Profiling perspective. For further information, see Connecting to a database.

Through this view, you can also access the actual analyzed data via the Data Explorer.

To access the analyzed data rows, right-click any of the lines in the table and select an action to perform:
  • View match rows: access a list of all rows that could be matched in the two identical column sets.
  • View not match rows: access a list of all rows that could not be matched in the two identical column sets.
  • View rows: access a list of all rows in the two identical column sets.
Information noteWarning: The data explorer does not support connections that have an empty username, such as Single sign-on of MS SQL Server. If you analyze data using such connection and you try to view data rows in the Data Explorer perspective, a warning message prompt you to set your connection credentials to the SQL Server.

The figure below illustrates the data explorer list of all rows that could be matched in the two sets, eight in this example.

Number of rows that match in the two sets.

From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed on indicators.

The figure below illustrates the data explorer list of all rows that could not be matched in the two sets, three in this example.

Number of rows that match in the two sets.

Defining the redundancy analysis

Procedure

  1. In the DQ Repository tree view, expand Data Profiling.
  2. Right-click the Analyses folder and select New Analysis.
    Contextual menu of the Analyses node.
    The Create New Analysis wizard opens.
  3. In the filter field, start typing redundancy analysis, select Redundancy Analysis from the list and click Next.
  4. In the Name field, enter a name for the current analysis.
    Information noteImportant:

    Do not use the following special characters in the item names: ~ ! ` # ^ * & \\ / ? : ; \ , . ( ) ¥ ' " « » < >

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  5. Set the analysis metadata (purpose, description, and author name) in the corresponding fields and click Finish.

Selecting the identical columns you want to compare

Procedure

  1. Expand DB connections and in the desired database, browse to the columns you want to analyze, select them and then click Finish to close the wizard.
    A file for the newly created analysis is listed under the Analysis folder in the DQ Repository tree view. The analysis editor opens with the defined analysis metadata.
    Overview of the Analysis Metadata section in the Analysis Settings tab.
    The display of the analysis editor depends on the parameters you set in the Preferences window. For more information, see Setting preferences of analysis editors and analysis results.
  2. Click Analyzed Column Sets to open the view where you can set the columns or modify your selection.
    In this example, you want to compare identical columns in the account and account_back tables.
    Overview of the Analysis Column Sets section in the Analysis Settings tab.
  3. From the Connection list, select the database connection relevant to the database to which you want to connect.
    You can find in this list all the database connections you create and centralize in the Talend Studio repository.
  4. Click A column Set to open the Column Selection dialog box.
  5. Browse the catalogs/schemas in your database connection to reach the table holding the columns you want to analyze.
    You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists will show only the tables/columns that correspond to the text you type in.
  6. Click the table name to list all its columns in the right-hand panel of the Column Selection dialog box.
  7. In the list to the right, select the check boxes of the columns you want to analyze and click OK to proceed to the next step.
    Selection of the columns to be analyzed.
    You can drag the columns to be analyzed directly from the DQ Repository tree view to the editor.
    If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected column will be automatically located under the corresponding connection in the tree view.
  8. Click B Column B Set and follow the same steps to select the second set of columns or drag it to the right column panel.
  9. If you want to match the data from the A set against the data from the B set and not vice versa, select the Compute only number of A rows not in B check box.
  10. Select the Ignore Null check box if you want to ignore the NULL values when matching.

Finalizing and executing the analysis

Procedure

  1. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.
  2. In the Analysis Parameter view and in the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to the selected database, if required.
    You can set this number according to the database available resources, that is the number of concurrent connections each database can support.
  3. If you have defined context variables in the Context view in the analysis editor, complete the following steps:
    1. Use the Data Filter and Analysis Parameter views to set/select context variables to filter data and to decide the number of concurrent connections per analysis respectively.
    2. In the Context Settings view, select from the list the context environment you want to use to run the analysis.
    For further information about contexts and variables, see Using context variables in analyses.
  4. Save the analysis and press F6 to execute it.
    A confirmation message is displayed.
  5. Click OK if you want to continue the operation.

Results

The Analysis Results view opens showing the analysis results.

In this example, 72.73% of the data present in the columns in the account table could be matched with the same data in the columns in the account_back table.

Results of the analysis in the Analysis Results section.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!