Creating a table analysis with SQL business rules - 6.5

Talend Data Fabric Studio User Guide

EnrichVersion
6.5
EnrichProdName
Talend Data Fabric
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

You can set up SQL business rules based on WHERE clauses and add them as indicators to table analyses. You can as well define expected thresholds on the SQL business rule indicator's value. The range defined is used for measuring the quality of the data in the selected table.

Note

It is also possible to create an analysis with SQL business rules on views in a database. The procedure is exactly the same as that for tables. For more information, see How to create a table analysis with an SQL business rule with a join condition.

Note

When you use the Java engine to run a column set analysis on big sets or on data with many problems, it is advisable to define a maximum memory size threshold to execute the analysis as you may end up with a Java heap error. For more information, see Defining the maximum memory size threshold.

How to create an SQL business rule

SQL business rules can be simple rules with WHERE clauses. They can also have join conditions in them to combine common values between columns in database tables and give a result data set.

For an example of a table analysis with a simple business rule, see How to create a table analysis with a simple SQL business rule. For an example of a table analysis with a business rule that has a join condition, see How to create a table analysis with an SQL business rule with a join condition.

Prerequisite(s): You have selected the Profiling perspective of the studio.

Creating the business rule

  1. In the DQ Repository tree view, expand Libraries > Rules.

  2. Right-click SQL.

  3. From the contextual menu, select New Business Rule to open the [New Business Rule] wizard.

    Consider as an example that you want to create a business rule to match the age of all customers listed in the age column of a defined table. You want to filter all the age records to identify those that fulfill the specified criterion.

  4. In the Name field, enter a name for this new SQL business rule.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  5. Set other metadata (purpose, description and author name) in the corresponding fields and then click Next.

  6. In the Where clause field, enter the WHERE clause to be used in the analysis.

    In this example, the WHERE clause is used to match the records where customer age is greater than 18.

  7. Click Finish to close the [New Business Rule] wizard.

    A sub-folder for this new SQL business rule is displayed under the Rules folder in the DQ Repository tree view. The SQL business rule editor opens with the defined metadata.

    Note

    In the SQL business rule editor, you can modify the WHERE clause or add a new one directly in the Data quality rule view.

  8. If required, set a value in the Criticality Level field.

    This will act as an indicator to measure the importance of the SQL business rule. This value is saved in the database and can be used later in the Talend Data Quality Portal.

    For more information about the Portal, see the Talend Data Quality Portal User and Administrator Guide. For more information about installing the Portal, see the Talend Installation Guide.

Creating a join condition

This step is not obligatory. You can decide to create a business rule without a join condition and use it with only the WHERE clause in the table analysis.

For an example of a table analysis with a simple business rule, see How to create a table analysis with a simple SQL business rule. For an example of a table analysis with a business rule that has a join condition, see How to create a table analysis with an SQL business rule with a join condition.

  1. In the SQL business rule editor, click Join Condition to open the corresponding view.

  2. Click the [+] button to add a row in the Join Condition table.

  3. Expand the Metadata folder in the DQ Repository tree view, and then browse to the columns in the tables on which you want to create the join condition.

    This join condition will define the relationship between a table A and a table B using a comparison operator on a specific column in both tables. In this example, the join condition will compare the "name" value in the Person and Person_Ref tables that have a common column called name.

    Note

    You must be careful when defining the join clause. In order to get an easy to understand result, it is advisable to make sure that the joined tables do not have duplicate values. For further information, see How to create a table analysis with an SQL business rule with a join condition.

  4. Drop the columns from the DQ Repository tree view to the Join Condition table.

    A dialog box is displayed prompting you to select where to place the column: in TableA or in TableB.

  5. Select a comparison condition operator between the two columns in the tables and save your modifications.

    In the analysis editor, you can now drop this newly created SQL business rule onto a table that has an "age" column. When you run the analysis, the join to the second column is done automatically.

    Warning

    The table to which to add the business rule must contain at least one of the columns used in the SQL business rule.

For more information about using SQL business rules as indicators on a table analysis, see Creating a table analysis with SQL business rules.

How to edit an SQL business rule

Prerequisite(s): You have selected the Profiling perspective of the studio.

To edit an SQL business rule, do the following:

  1. In the DQ Repository tree view, expand Libraries > Rules > SQL.

  2. Right-click the SQL business rule you want to open and select Open from the contextual menu.

    The SQL business rule editor opens displaying the rule metadata.

  3. Modify the business rule metadata or the WHERE clause as required.

  4. Click the save icon on top of the editor to save your modifications.

    The SQL business rule is modified as defined.

How to create a table analysis with a simple SQL business rule

You can create analyses on either tables or views in a database using SQL business rules. The procedure for creating such analysis is the same for a table or a view.

Prerequisite(s):

  • At least one SQL business rule has been created in the Profiling perspective of the studio. For further information about creating SQL business rules, see How to create an SQL business rule

  • At least one database connection is set in the Profiling perspective of the studio. For further information, see Connecting to a database.

In this example, you want to add the SQL business rule created in How to create an SQL business rule to a top_custom table that contains an age column. This SQL business rule will match the customer ages to define those who are older than 18.

Defining the analysis

In the DQ Repository tree view, expand Data Profiling.

  1. Right-click the Analyses folder and select New Analysis.

    The [Create New Analysis] wizard opens.

  2. In the filter field, start typing business rule analysis, select Business Rule Analysis and click Next.

  3. In the Name field, enter a name for the current analysis.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  4. Set column analysis metadata (purpose, description and author name) in the corresponding fields and click Next to proceed to the next step.

Selecting the table you want to analyze

  1. Expand DB Connections, browse to the table to be analyzed and select it.

  2. Click Finish to close the [Create New Analysis] wizard.

    Note

    You can directly select the data quality rule you want to add to the current analysis by clicking the Next button in the [New Analysis] wizard or you can do that at later stage in the Analyzed Tables view as shown in the following steps.

    The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under Analyses in the DQ Repository tree view.

  3. If required:

    • Click Select Tables to open the [Table Selection] dialog box and select new table(s) to analyze.

      You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists will show only the tables/columns that correspond to the text you type in.

    • Select another connection from the Connection list to connect to a different database. This list has all the connections created in the Studio. If the tables listed in the Analyzed Tables view do not exist in the new database connection you want to set, you receive a warning message that enables you to continue or cancel the operation.

  4. Right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected column is automatically located under the corresponding connection in the tree view.

Selecting the business rule

  1. Click the icon next to the table name where you want to add the SQL business rule.

    The [Business Rule Selector] dialog box is displayed.

  2. Expand the Rules folder and select the check box(es) of the predefined SQL business rule(s) you want to use on the corresponding table(s).

  3. Click OK.

    The selected business rule is listed below the table name in the Analyzed Tables view.

    You can also drag the business rule directly from the DQ Repository tree view to the table in the analysis editor.

  4. If required, right-click the business rule and select View executed query.

    The SQL editor opens in the Studio to display the query.

  5. If required, click Data Filter in the analysis editor to open the view where you can set a filter on the data of the analyzed table(s).

  6. Save the analysis and press F6 to execute it.

    An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis Results view.

    All age records in the selected table are evaluated against the defined SQL business rule. The analysis results has two bar charts: the first is a row count indicator that shows the number of rows in the analyzed table, and the second is a match and non-match indicator that indicates in red the age records from the "analyzed result set" that do not match the criteria (age below 18).

  7. Right-click the business rule results in the second table, or right-click the result bar in the chart itself and select:

    Option

    To...

    View valid rows

    access a list in the SQL editor of all valid rows measured against the pattern used on the selected table

    View invalid rows

    access a list in the SQL editor of all invalid rows measured against the pattern used on the selected table

    Analyze duplicates

    generates a ready-to-use analysis that analyzes duplicates in the table, if any, and give the row and duplicate counts. For further information, see How to generate an analysis on the join results to analyze duplicates.

    Generate Job

    generate a Job that uses the Extract Load Transform process to write the valid or invalid rows of the analyzed table in an output file. For further information, see Recuperating valid and invalid rows in a table analysis .

    You can carry out a table analysis in a direct and more simplified way. For further information, see How to create a table analysis with an SQL business rule in a shortcut procedure.

How to create a table analysis with an SQL business rule with a join condition

In some cases, you may need to analyze database tables or views using an SQL business rule that has a join clause that combines records from two tables in a database. This join clause will compare common values between two columns and give a result data set. Then the data in this set will be analyzed against the business rule.

Depending on the analyzed data and the join clause itself, several different results of the join are possible, for example #match + #no match > #row count,  #match + #no match < #row count or #match + #no match = #row count.

The example below explains in detail the case where the data set in the join result is bigger than the row count (#match + #no match > #row count) which indicates duplicates in the processed data.

Prerequisite(s):

  • At least one SQL business rule has been created in the Profiling perspective of the studio. For further information about creating SQL business rules, see How to create an SQL business rule

  • At least one database connection is set in the Profiling perspective of the studio. For further information, see Connecting to a database.

In this example, you want to add the SQL business rule created in How to create an SQL business rule to a Person table that contains the age and name columns. This SQL business rule will match the customer ages to define those who are older than 18. The business rule also has a join condition that compares the "name" value between the Person table and another table called Person_Ref through analyzing a common column called name.

Below is a capture of both tables:

Below is a capture of the result of the join condition between these two tables:

The result set may give duplicate rows as it is the case here. Thus the results of the analysis may become a bit harder to understand. The analysis here will not analyze the rows of the table that match the business rule but it will run on the result set given by the business rule.

  1. Define the table analysis and select the table you want to analyze as outlined in How to create a table analysis with a simple SQL business rule.

    The selected table is listed in the Analyzed Tables view.

  2. Add the business rule with the join condition to the selected table through clicking the icon next to the table name.

    This business rule has a join condition that compares the "name" value between two different tables through analyzing a common column. For further information about SQL business rules, see How to create an SQL business rule.

  3. Save the analysis and press F6 to execute it.

    An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis Results view.

    All age records in the selected table are evaluated against the defined SQL business rule. The analysis results has two bar charts: the first is a row count indicator that shows the number of rows in the analyzed table, and the second is a match and non-match indicator that indicates the age records from the "analyzed result set" that do not match the criteria (age below 18).

    Note

    If a join condition is used in the SQL business rule, the number of the rows of the join (#match + # no match) can be different from the number of the analyzed rows (row count). For further information, see How to create a table analysis with an SQL business rule with a join condition.

  4. Right-click the Row Count row in the first table and select View rows.

    The SQL editor opens in the Studio to display a list of the analyzed rows.

  5. Right-click the business rule results in the second table, or right-click the result bar in the chart itself and select:

    Option

    To...

    View valid rows

    access a list in the SQL editor of all valid rows measured against the pattern used on the selected table

    View invalid rows

    access a list in the SQL editor of all invalid rows measured against the pattern used on the selected table

    Analyze duplicates

    generates a ready-to-use analysis that analyzes duplicates in the table, if any, and give the row and duplicate counts. For further information, see How to generate an analysis on the join results to analyze duplicates.

    Generate Job

    generate a Job that uses the Extract Load Transform process to write the valid or invalid rows of the analyzed table in an output file. For further information, see Recuperating valid and invalid rows in a table analysis .

    Below is the list of the invalid rows in the analyzed table.

  6. In the SQL editor, click the save icon on the toolbar to save the executed query on the SQL business rule and list it under the Libraries > Source Files folder in the DQ Repository tree view.

    For further information, see Saving the queries executed on indicators.

To better understand the Business Rule Statistics bar chart in the analysis results, do the following:

  1. In the analysis editor, right-click the business rule and select View executed query.

    The SQL editor opens in the Studio.