Using regular expressions and SQL patterns in a column analysis - 6.3

Talend MDM Platform Studio User Guide

EnrichVersion
6.3
EnrichProdName
Talend MDM Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

You can use regular expressions or SQL patterns in column analyses. These expressions and patterns will help you define the content, structure and quality of the data included in the analyzed columns.

For more information on regular expressions and SQL patterns, see Patterns and indicators and Table analyses.

How to add a regular expression or an SQL pattern to a column analysis

You can add to any column analysis one or more regular expressions or SQL patterns against which you can match the content of the column to be analyzed.

Warning

If the database you are using does not support regular expressions or if the query template is not defined in the studio, you need first to declare the user defined function and define the query template before being able to add any of the specified patterns to the column analysis. For more information, see Managing User-Defined Functions in databases.

Prerequisite(s): You have selected the Profiling perspective in the studio. A column analysis is open in the analysis editor.

To add a regular expression or an SQL pattern to a column analysis, do the following:

  1. Follow the steps outlined in How to define the columns to be analyzed to create a column analysis.

  2. In the Analyze Columns view in the analysis editor, click the icon next to the column name to which you want to add a regular expression or an SQL pattern, the email column in this example.

    The [Pattern Selector] dialog box opens.

  3. Expand Patterns and browse to the regular expression or/and the SQL patterns you want to add to the column analysis.

  4. Select the check box(es) of the expression(s) or pattern(s) you want to add to the selected column.

  5. Click OK to proceed to the next step.

    The added regular expression(s) or SQL pattern(s) are displayed under the analyzed column in the Analyzed Column list.

    You can add a regular expression or an SQL pattern to a column simply by a drag and drop operation from the DQ Repository tree view onto the analyzed column.

  6. Press F6 to execute the column analysis.

    The editor switches to the Analysis result view. The results of the column analysis include those for pattern matching.

If the regular expression you add to the column analysis is defined for a database, you will be able to generate ELT Jobs to recuperate valid and invalid rows.

If the regular expression you add to the column analysis is defined for the Java or the Default language, you will be able to generate an ETL Job to handle rows.

For further information on how to generate a Job to recuperate valid, invalid or both types of rows, see Recuperating valid and invalid rows in a column analysis . For further information on defining regular expressions, see How to create a new regular expression or SQL pattern

How to edit a pattern in the column analysis

Prerequisite(s): You have selected the Profiling perspective in the studio. A column analysis is open in the analysis editor.

To edit a pattern added to an analyzed column:

  1. In the Analyze Columns view in the analysis editor, right-click the pattern you want to edit and select Edit pattern from the contextual menu.

    The pattern editor opens showing the selected pattern metadata.

  2. In the Pattern Definition view, edit the pattern definition, or change the selected database, or add other patterns specific to available databases using the [+] button.

    If the regular pattern is simple enough to be used in all databases, select Default in the list.

    When you edit a pattern through the analysis editor, you modify the pattern in the studio repository. Make sure that your modifications are suitable for all other analyses that may be using the modified pattern.

  3. Save your changes.

How to view the data analyzed against patterns

When you add one or more patterns to an analyzed column, you check all existing data in the column against the specified pattern(s). After the execution of the column analysis, using the java or the SQL engine you can access a list of all the valid/invalid data in the analyzed column.

When you use the Java engine to run the analysis, the view of the actual data will open in the Profiling perspective of the studio. While if you use the SQL engine to execute the analysis, the view of the actual data will open in the Data Explorer perspective.

Prerequisite(s):

  • You have selected the Profiling perspective in the studio.

  • You have installed in the studio the SQL explorer libraries that are required for data quality.

If you do not install these libraries, the Data Explorer perspective will be missing from the studio and many features will not be available. For further information about identifying and installing external modules, see the Talend Installation Guide.

To view the actual data in the column analyzed against a specific pattern, do the following:

  1. Follow the steps outlined in How to define the columns to be analyzed and How to add a regular expression or an SQL pattern to a column analysis to create a column analysis that uses a pattern.

  2. Execute the column analysis.

    The editor switches to the Analysis Results view.

  3. Browse to Pattern Matching under the name of the analyzed column.

    The generated graphic for the pattern matching is displayed accompanied with a table that details the matching results.

  4. Right-click the pattern line in the Pattern Matching table and select:

    Option

    To...

    View valid/invalid values

    open a view of all valid/invalid values measured against the pattern used on the selected column

    View valid/invalid rows

    open a view of all valid/invalid rows measured against the pattern used on the selected column

    Generate Jobs

    generate ready-to-use Jobs that will recuperate valid/invalid rows or both types of rows in the selected column and write them in output files or databases.

    For more information, see Recuperating matching and non-matching rows

When using the SQL engine, the view opens in the Data Explorer perspective listing valid/invalid rows or values of the analyzed data according to the limits set in the data explorer.

This explorer view will also give some basic information about the analysis itself. Such information is of great help when working with multiple analysis at the same time.

The data explorer does not support connections which has empty user name, such as Single sign-on of MS SQL Server. If you analyze data using such connection and you try to view data rows and values in the Data Explorer perspective, a warning message prompt you to set your connection credentials to the SQL Server.

When using the Java engine, the view opens in the Profiling perspective of the studio listing the number of valid/invalid data according to the row limit you set in the Analysis parameters view of the analysis editor. For more information, see Using the Java or the SQL engine.

You can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree view if you click the save icon on the SQL editor toolbar. For more information, see Saving the queries executed on indicators.

For more information about the data explorer Graphical User Interface, see Main window of the data explorer.

How to recuperate valid and /or invalid rows

When you add one or more patterns to an analyzed column, you check all existing data in the column against the specified pattern(s). After the execution of the column analysis, you can generate a ready-to-use Job that recuperates the valid, invalid or both types of rows and write them in output files or databases.

For further information, see Recuperating valid and invalid rows in a column analysis .