Creating a simple table analysis (Column Set Analysis) - 6.5

Talend Data Fabric Studio User Guide

EnrichVersion
6.5
EnrichProdName
Talend Data Fabric
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

You can analyze the content of a set of columns. This set can represent only some of the columns in the defined table or the table as a whole.

The analysis of a set of columns focuses on a column set (full records) and not on separate columns as it is the case with the column analysis. The statistics presented in the analysis results (row count, distinct count, unique count and duplicate count) are measured against the values across all the data set and thus do not analyze the values separately within each column.

With the Java engine, you may also apply patterns on each column and the result of the analysis will give the number of records matching all the selected patterns together. For further information, see How to add patterns to the analyzed columns.

Note

When you use the Java engine to run a column set analysis on big sets or on data with many problems, it is advisable to define a maximum memory size threshold to execute the analysis as you may end up with a Java heap error. For more information, see Defining the maximum memory size threshold.

How to create an analysis of a set of columns using patterns

This type of analysis provides simple statistics on the full records of the analyzed column set and not on the values within each column separately. For more information about simple statistic indicators, see Simple statistics.

With this analysis, you can use patterns to validate the full records against all patterns and have a single-bar result chart that shows the number of the rows that match "all" the patterns..

How to define the set of columns to be analyzed

Prerequisite(s): At least one database connection is set in the Profiling perspective of the studio. For further information, see Connecting to a database.

Defining the analysis

  1. In the DQ Repository tree view, expand Data Profiling.

  2. Right-click the Analyses folder and select New Analysis.

    The [Create New Analysis] wizard opens.

  3. In the filter field, start typing column set analysis, select Column Set Analysis from the list and click Next.

  4. In the Name field, enter a name for the current analysis.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  5. Set column analysis metadata (purpose, description and author name) in the corresponding fields and click Next to proceed to the next step.

Selecting the set of columns you want to analyze

  1. Expand DB connections.

  2. In the desired database, browse to the columns you want to analyze, select them and then click Finish to close this [New analysis] wizard.

    The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under Analyses in the DQ Repository tree view.

    A sample data is displayed in the Data Preview section and the selected columns are displayed in the Analyzed Column section of the analysis editor.

    In this example, you want to analyze a set of six columns in the customer table: account number (account_num), education (education), email (email), first name (fname), second name (Iname) and gender (gender). The statistics presented in the analysis results are the row count, distinct count, unique count and duplicate count which all apply on records (values of a set of columns).

  3. In the Data preview section, select:

    Option

    To...

    New Connection

    open a wizard and create or change the connection to the data source from within the editor.

    For further information about how to create a connection to data sources, see Connecting to a database and Connecting to a file.

    The Connection field on top of this section lists all the connections created in the Studio.

    Select Columns

    open the Column Selection dialog box where you can select the columns to analyze or change the selection of the columns listed in the table.

    From the open dialog box, you can filter the table or column lists by using the Table filter or Column filter fields respectively.

    n first rows

    or

    n random rows

    list in the table N first data records from the selected columns or list N random records from the selected columns.

    Refresh Data

    display the data in the selected columns according to the criteria you set.

    Run with sample data

    run the analysis only on the sample data set in the Limit field.

  4. In the Limit field, set the number for the data records you want to display in the table and use as sample data.

  5. If required, right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view. The selected column is automatically located under the corresponding connection in the tree view.

  6. Use the delete, move up or move down buttons to manage the analyzed columns when necessary.

How to add patterns to the analyzed columns

You can add patterns to one or more of the analyzed columns to validate the full record (all columns) against all the patterns, and not to validate each column against a specific pattern as it is the case with the column analysis. The results chart is a single bar chart for the totality of the used patterns. This chart shows the number of the rows that match "all" the patterns.

Warning

Before being able to use a specific pattern with a set of columns analysis, you must manually set in the patterns settings the pattern definition for Java, if it does not already exist. Otherwise, a warning message will display prompting you to set the definition of the Java regular expression.

Prerequisite(s): An analysis of a set of columns is open in the analysis editor in the Profiling perspective of the studio. For more information, see How to define the set of columns to be analyzed.

  1. Click the icon next to each of the columns you want to validate against a specific pattern.

    The [Pattern Selector] dialog box is displayed.

    Note

    You can add only regular expressions to the analyzed columns.

    You can drop the regular expression directly from the Patterns folder in the DQ Repository tree view directly to the column name in the column analysis editor.

    Warning

    If no Java expression exists for the pattern you want to add, a warning message will display prompting you to add the pattern definition for Java. Click Yes to open the pattern editor and add the Java regular expression, then proceed to add the pattern to the analyzed columns.

    In this example, you want to add a corresponding pattern to each of the analyzed columns to validate data in these columns against the selected patterns. The result chart will show the percentage of the matching/non-matching values, the values that respect the totality of the used patterns.

  2. In the [Pattern Selector] dialog box, expand Patterns and browse to the regular expression you want to add to the selected column.

  3. Select the check box(es) of the expression(s) you want to add to the selected column, then Click OK.

    The added regular expression(s) display(s) under the analyzed column(s) in the Analyzed Columns view and the All Match indicator is displayed in the Indicators list in the Indicators view.

How to finalize and execute the analysis of a set of columns

What is left before executing this set of columns analysis is to define the indicator settings, data filter and analysis parameters.

Prerequisite(s): A column set analysis has already been defined in the Profiling perspective of the studio. For further information, see How to define the set of columns to be analyzed and How to add patterns to the analyzed columns.

  1. Click Indicators in the analysis editor to open the corresponding view.

    The indicators representing the simple statistics are by-default attached to this type of analysis. For further information about the indicators for simple statistics, see section Simple statistics.

  2. Click the option icon to open a dialog box where you can set options for each indicator. For more information about indicators management, see Indicators.

  3. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.

  4. In the Analysis Parameters view:

    • In the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to the selected database connection.

      You can set this number according to the database available resources, that is the number of concurrent connections each database can support.

    • From the Execution engine list, select the engine, Java or SQL, you want to use to execute the analysis.

      • If you select the Java engine, the Store data check box is selected by default and cannot be unselected. Once the analysis is executed, the profiling results are always available locally to drill down through the Analysis Results > Data view. For further information, see How to filter data against patterns.

        Executing the analysis with the Java engine uses disk space as all data is retrieved and stored locally. If you want to free up some space, you may delete the data stored in the main studio directory, at Talend-Studio>workspace>project_name>Work_MapDB.

      • If you select the SQL engine, you can use the Store data check box to decide whether to store locally the analyzed data and access it in the Analysis Results > Data view.

        Note

        If the data you are analyzing is very big, it is advisable to leave the Store data check box unselected in order not to store the results at the end of the analysis computation.

  5. If you have defined context variables in the Contexts view in the analysis editor:

    • use the Data Filter and Analysis Parameter views to set/select context variables to filter data and to decide the number of concurrent connections per analysis respectively.

    • In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.

    For further information about contexts and variables, see Using context variables in analyses.

  6. Save the analysis and press F6 to execute it.

    The analysis editor switches to the Analysis Results view where you can read the analysis results in tables and graphics. The graphical result provides the simple statistics on the full records of the analyzed column set and not on the values within each column separately.

    When you use patterns to match the content of the set of columns, another graphic is displayed to illustrate the match and non-match results against the totality of the used patterns.

  7. In the Simple Statistics table, right-click an indicator result and select View Rows or View Values.

    • If you run the analysis with the Java engine, a list of the analyzed data is opened in the Profiling perspective.

    • If you run the analysis with the SQL engine, a list of the analyzed data is opened in the Data Explorer perspective.

  8. In the All Match table, right-click the result row and select Generate an ETL job to handle rows.

    A ready-to-use Job is generated and opened in the Integration perspective. This Job will group the valid/invalid rows and write them in two separate files. For further information, see Recuperating matching and non-matching rows.

    Note

    The All Match table is available only when you run the analysis with the Java engine.

  9. In the Data view, click Filter Data to filter the valid/invalid data according to the used patterns.

    You can filter data only when you run the analysis with the Java engine. For further information, see How to filter data against patterns.

How to filter data against patterns

After analyzing a set of columns against a group of patterns and having the results of the rows that match or do not match "all" the patterns, you can filter the valid/invalid data according to the used patterns.

Prerequisite(s): An analysis of a set of columns is open in the analysis editor in the Profiling perspective of the studio. You have used the Java engine to execute the analysis.

For more information, see How to define the set of columns to be analyzed and How to add patterns to the analyzed columns.

To filter data resulted from the analysis of a set of columns, do the following:

  1. In the analysis editor, click the Analysis Results tab at the bottom of the editor to open the detailed result view.

  2. Click Data to open the corresponding view.

    A table lists the actual analyzed data in the analyzed columns.

  3. Click Filter Data on top of the table.

    A dialog box is displayed listing all the patterns used in the column set analysis.

  4. Select the check box(es) of the pattern(s) according to which you want to filter data.

  5. Select a display option as the following:

    Select

    To...

    All data

    show all analyzed data.

    matches

    show only the data that matches the selected pattern.

    non-matches

    show the data that does not match the selected pattern(s).

  6. Click Finish to close the dialog box.

    In this example, data is filtered against the Email Address pattern, and only the data that does not match is displayed.

    All email addresses that do not match the selected pattern appear in red. Any data row that has a missing value appear with a red background.

    The Previous and Next buttons under the table helps you to navigate back and forth through pages.

    Numbered buttons are displayed under the table to access pages directly:

    • when you open the Data view for the first time after running the analysis,

    • if you did not select a pattern in the Filter Data dialog box, or

    • if you selected All data as the display option in the Filter Data dialog box.

How to recuperate matching and/or non-matching rows

When you add patterns to the analysis of a set of columns, the result chart will show the percentage of the values in all the columns that match all used patterns and not only one of them. After the execution of the analysis, you can generate ready-to-use Jobs that will recuperate the matching/non-matching rows and write them in output files or databases.

For further information on how to generate such a Job, see Recuperating matching and non-matching rows

How to create a column analysis from a simple table analysis

You can create a column analysis on one or more columns defined in a simple table analysis (column set analysis).

Prerequisite(s): A simple table analysis is defined in the analysis editor in the Profiling perspective of the studio.

  1. Open the analysis.

  2. In the Analyzed Columns view, right-click the column(s) you want to create a column analysis on.

  3. Select Column analysis from the contextual menu.

    The [New Analysis] wizard opens.

  4. In the Name field, enter a name for the new column analysis and then click Next to proceed to the next step.

    The analysis editor opens with the defined metadata and a folder for the newly created analysis is listed under the Analyses folder in the DQ Repository tree view.

  5. Follow the steps outlined in Creating a basic analysis on a database column to continue creating the column analysis.