Creating a table analysis with an SQL business rule with a join condition - Cloud - 8.0

Talend Studio User Guide

Version
Cloud
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-02-29
Available in...

Big Data Platform

Cloud API Services Platform

Cloud Big Data Platform

Cloud Data Fabric

Cloud Data Management Platform

Data Fabric

Data Management Platform

Data Services Platform

MDM Platform

Real-Time Big Data Platform

About this task

In some cases, you may need to analyze database tables or views using a SQL business rule that has a join clause that combines records from two tables in a database. This join clause will compare common values between two columns and give a result dataset. Then the data in this set will be analyzed against the business rule.

Depending on the analyzed data and the join clause itself, several different results of the join are possible, for example #match + #no match > #row count,  #match + #no match < #row count or #match + #no match = #row count.

The example below explains in detail the case where the dataset in the join result is bigger than the row count (#match + #no match > #row count) which indicates duplicates in the processed data.

Before you begin:
  • At least one SQL business rule has been created in the Profiling perspective of Talend Studio.
  • At least one database connection is set in the Profiling perspective of Talend Studio.

In this example, you want to add the SQL business rule created in Creating an SQL business rule to a Person table that contains the age and name columns. This SQL business rule will match the customer ages to define those who are older than 18. The business rule also has a join condition that compares the "name" value between the Person table and another table called Person_Ref through analyzing a common column called name.

Below is a capture of both tables:
Example of the tables used in the procedure.
Below is a capture of the result of the join condition between these two tables:
Results of the join condition between the two tables.

The result set may give duplicate rows as it is the case here. Thus the results of the analysis may become a bit harder to understand. The analysis here will not analyze the rows of the table that match the business rule but it will run on the result set given by the business rule.

Procedure

  1. Define the table analysis and select the table you want to analyze.
    The selected table is listed in the Analyzed Tables view.
    Overview of the Analyzed Tables section in the Analysis Results tab.
  2. To add the business rule with the join condition to the selected table, click Add rule icon next to the table name.
    This business rule has a join condition that compares the "name" value between two different tables through analyzing a common column.
  3. Save the analysis and press F6 to execute it.

    An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis Results view.

    Graphs showing the matching and non-matching values according to the business rule.

    All age records in the selected table are evaluated against the defined SQL business rule. The analysis results has two bar charts: the first is a row count indicator that shows the number of rows in the analyzed table, and the second is a match and non-match indicator that indicates the age records from the "analyzed result set" that do not match the criteria (age below 18).

    Note: If a join condition is used in the SQL business rule, the number of the rows of the join (#match + # no match) can be different from the number of the analyzed rows (row count).
  4. Right-click the Row Count row in the first table and select View rows.

    The SQL editor opens in Talend Studio to display a list of the analyzed rows.

    List of the analyzed rows in the SQL editor.
  5. Right-click the business rule results in the second table, or right-click the result bar in the chart itself.
  6. Select an option:
    Option Description
    View valid rows access a list in the SQL editor of all valid rows measured against the pattern used on the selected table
    View invalid rows access a list in the SQL editor of all invalid rows measured against the pattern used on the selected table
    Analyze duplicates generates a ready-to-use analysis that analyzes duplicates in the table, if any, and give the row and duplicate counts.
    Generate Job generate a Job that uses the Extract Load Transform process to write the valid or invalid rows of the analyzed table in an output file.

    This option is available for Microsoft SQL Server, MySQL, Oracle, and PostgreSQL connections.

    Below is the list of the invalid rows in the analyzed table.

    List of the invalid rows in the SQL editor.
  7. In the SQL editor, click the save icon on the toolbar to save the executed query on the SQL business rule and list it under the Libraries > Source Files folder in the DQ Repository tree view.
  8. To better understand the Business Rule Statistics bar chart in the analysis results, do the following:
    1. In the analysis editor, right-click the business rule and select View executed query.
      Contextual menu of a business rule in the Analyzed Tables section.

      The SQL editor opens in Talend Studio.

      Overview of the SQL editor.
    2. Modify the query in the top part of the editor to read as the following: SELECT * FROM `my_person_joins`.`person` Person JOIN `my_person_joins`.`Person_ref` Person_ref ON (Person.`name`=Person_ref.`Name`).
      This will list the result dataset of the join condition in the editor.
    3. In the top left corner of the editor, click Execute SQL icon to execute the query.

      The query result, that is the analyzed result set, is listed in the bottom part of the editor.

    4. Go back to the analysis editor and click the Analysis Results tab at the bottom of the editor to open a detail view of the analysis results.
      Detailed view of the analysis results for a specific table.

      The analyzed result set may contain more or fewer rows than the analyzed table. In this example, the number of match and non-match records (5 + 2 = 7) exceeds the number of analyzed records (6) because the join of the two tables generates more rows than expected.

      Here 5 rows (71.43%) match the business rule and 2 rows do not match. Because the join generates duplicate rows, this result does not mean that 5 rows of the analyzed table match the business rule. It only means that 5 rows among the 7 rows of the result set match the business rule. Actually, some rows of the analyzed tables may not be even analyzed against the business rule. This happens when the join excludes these rows. For this reason, it is advised to check for duplicates on the columns used in the join of the business rule in order to make sure that the join does not remove or add rows in the analyzed result set.  Otherwise the interpretation of the result is more complex.

      In the Analysis Results view, if the number of match and non-match records exceeds the number of analyzed records, you can generate a ready-to-use analysis that will analyze the duplicates in the selected table.