Creating a column set analysis on a delimited file using patterns - 6.4

Talend Big Data Platform Studio User Guide

EnrichVersion
6.4
EnrichProdName
Talend Big Data Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This type of analysis provide simple statistics on the number of records falling in certain categories, including the number of rows, the number of null values, the number of distinct and unique values, the number of duplicates, or the number of blank fields. For more information about these indicators, see Simple statistics.

It is also possible to add patterns to this type of analysis and have a single-bar result chart that shows the number of the rows that match "all" the patterns.

How to define the set of columns to be analyzed in a delimited file

You can analyze the content of a set of columns in a delimited file. This set can represent only some of the columns in the defined table or the table as a whole.

You can then execute the created analysis using the Java engine.

Prerequisite(s): At least one connection to a delimited file is set in the Profiling perspective of the studio. For further information, see Connecting to a database.

Warning

When carrying out this type of analysis, the set of columns to be analyzed must not include a primary key column.

To define the set of columns to analyzed, do the following:

  1. In the DQ Repository tree view, expand the Data Profiling folder.

  2. Right-click the Analyses folder and select New Analysis.

    The [Create New Analysis] wizard opens.

  3. In the filter field, start typing Column Set Analysis, select Column Set Analysis and then click Next.

  4. In the Name field, enter a name for the current analysis.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  5. If required, set column analysis metadata (purpose, description and author name) in the corresponding fields and click Next to proceed to the next step.

  6. Expand the FileDelimited connection and browse to the set of columns you want to analyze.

  7. Select the columns to be analyzed, and then click Finish to close this [New analysis] wizard.

    The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under Analysis in the DQ Repository tree view.

  8. If required, select another connection from the Connection box in the Analyzed Columns view. This box lists all the connections created in the Studio with the corresponding database names.

    By default, the delimited file connection you have selected in the previous step is displayed in the Connection box.

  9. If required, click the Select columns to analyze link to open a dialog box where you can modify your column selection.

    Note

    You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists will show only the tables/columns that correspond to the text you type in.

  10. In the column list, select the check boxes of the column(s) you want to analyze and click OK to proceed to the next step.

    In this example, you want to analyze a set of six columns in the delimited file: account number (account_num), education (education), email (email), first name (fname), last name (lname) and gender (gender). You want to identify the number of rows, the number of distinct and unique values and the number of duplicates.

  11. Use the delete, move up or move down buttons to manage the analyzed columns.

If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected column will be automatically located under the corresponding connection in the tree view.

How to add patterns to the analyzed columns in the delimited file

Now, you can add patterns to one or more of the analyzed columns to validate the full record (all columns) against all the patterns, and not to validate each column against a specific pattern as it is the case with the column analysis. The results chart is a single bar chart for the totality of the used patterns. This chart shows the number of the rows that match "all" the patterns.

Warning

Before being able to use a specific pattern with a set of columns analysis, you must manually set in the patterns settings the pattern definition for Java, if it does not already exist. Otherwise, a warning message will display prompting you to set the definition of the Java regular expression.

Prerequisite(s): An analysis of a set of columns is open in the analysis editor in the Profiling perspective of the studio. For more information, see How to define the set of columns to be analyzed.

To add patterns to the analysis of a set of columns, do the following:

  1. Click the icon next to each of the columns you want to validate against a specific pattern.

    The [Pattern Selector] dialog box is displayed.

    Note

    You can add only regular expressions to the analyzed columns.

    You can drop the regular expression directly from the Patterns folder in the DQ Repository tree view directly to the column name in the column analysis editor.

    Warning

    If no Java expression exists for the pattern you want to add, a warning message will display prompting you to add the pattern definition for Java. Click Yes to open the pattern editor and add the Java regular expression, then proceed to add the pattern to the analyzed columns.

    In this example, you want to add a corresponding pattern to each of the analyzed columns to validate data in these columns against the selected patterns. The result chart will show the percentage of the matching/non-matching values, the values that respect the totality of the used patterns.

  2. In the [Pattern Selector] dialog box, expand Patterns and browse to the regular expression you want to add to the selected column.

  3. Select the check box(es) of the expression(s) you want to add to the selected column.

  4. Click OK to proceed to the next step.

    The added regular expression(s) display(s) under the analyzed column(s) in the Analyzed Columns view and the All Match indicator is displayed in the Indicators list in the Indicators view.

How to finalize and execute the column set analysis on a delimited file

What is left before executing this set of columns analysis is to define indicators, data filter and analysis parameters.

Prerequisite(s):A column set analysis is defined in the Profiling perspective of the studio. For further information, see How to define the set of columns to be analyzed in a delimited file and How to add patterns to the analyzed columns in the delimited file.

  1. Click Indicators in the analysis editor to open the corresponding view.

    Note

    The indicators representing the simple statistics are by-default attached to this type of analysis. For further information about the indicators for simple statistics, see section Simple statistics.

  2. If required, click the option icon to open a dialog box where you can set options for each indicator. For more information about indicators management, see Indicators.

  3. If required, click Data Filter in the analysis editor to display its view and filter data through SQL "WHERE" clauses.

  4. In the Analysis Parameters view, select the Allow drill down check box to store locally the data that will be analyzed by the current analysis.

  5. In the Max number of rows kept per indicator field enter the number of the data rows you want to make accessible.

    Note

    The Allow drill down check box is selected by default, and the maximum analyzed data rows to be shown per indicator is set to 50.

  6. Press F6 to execute the analysis.

    The editor switches to the Analysis Results view and displays the graphical result corresponding to the Simple Statistics indicators used to analyze the defined set of columns.

    When you use patterns to match the content of the columns to be analyzed, another graphic is displayed to illustrates the match results against the totality of the used patterns.

How to filter analysis data against patterns

The procedure to filter the data of the analysis of a delimited file is the same as that for the database analysis. For further information, see How to filter data against patterns.