Defining the columns to be analyzed and setting indicators - 6.2

Talend Data Services Platform Studio User Guide

EnrichVersion
6.2
EnrichProdName
Talend Data Services Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

How to define the columns to be analyzed

The first step in analyzing the content of one or multiple columns is to define the column(s) to be analyzed. The analysis results provides statistics about the values within each column.

Prerequisite(s): At least one database connection is set in the Profiling perspective of the studio. For further information, see Connecting to a database.

To analyze one or more columns, do the following:

Defining the analysis

  1. In the DQ Repository tree view, expand Data Profiling.

  2. Right-click the Analysis folder and select New Analysis.

    The [Create New Analysis] wizard opens.

  3. In the filter field, start typing basic column analysis, select Basic Column Analysis and click Next.

  4. In the Name field, enter a name for the current column analysis.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  5. Set column analysis metadata (purpose, description and author name) in the corresponding fields and click Next to proceed to the next step.

Selecting the database columns and setting sample data

  1. Expand DB connections and in the desired database, browse to the columns you want to analyze.

    Note

    When profiling a DB2 database, if double quotes exist in the column names of a table, the double quotation marks cannot be retrieved when retrieving the column. Therefore, it is recommended not to use double quotes in column names in a DB2 database table.

  2. Select the columns and then click Finish to close the wizard.

    A file for the newly created column analysis is listed under the Analysis node in the DQ Repository tree view, and the analysis editor opens with the analysis metadata.

    This example analyzes full names, email addresses and sales figures.

  3. In the Data preview view, click Refresh Data.

    The data in the selected columns is displayed in the table.

  4. In the Data preview view, select:

    Option

    To...

    New Connection

    open a wizard and create a connection to the data source from within the editor.

    For further information about how to create a connection to data sources, see Connecting to a database and Connecting to a file.

    The Connection field on top of this section lists all the connections created in the Studio.

    Select Columns

    open the Column Selection dialog box where you can select the columns to analyze or change the selection of the columns listed in the table.

    From the open dialog box, you can filter the table or column lists by using the Table filter or Column filter fields respectively.

    Select Indicators

    open the [Indicator Selection] dialog box where you can select the indicators to use for profiling columns.

    For further information, see How to set indicators on columns.

    n first rows

    or

    n random rows

    list in the table N first data records from the selected columns or list N random records from the selected columns.

    Refresh Data

    display the data in the selected columns according to the criteria you set.

  5. In the Limit field, set the number for the data records you want to display in the table and use as sample data.

  6. In the Analyzed Columns view, use the arrows in the top right corner to open different pages in the view if you analyze large number of columns.

    You can drag the columns to be analyzed directly from the DQ Repository tree view to the Analyzed Columns list in this view.

    If one of the columns you want to analyze is a primary or a foreign key, its data mining type becomes automatically Nominal when you list it in the Analyzed Columns view. For more information on data mining types, see Data mining types.

  7. If required, right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view to locate it in the database connection in the DQ Repository tree view.

When you select to analyze Date columns and run the analysis with the Java engine, the date information is stored in the studio and in the datamart as regular date/time of format YYYY-MM-DD HH:mm:ss.SSS for date/timestamp and of format HH:mm:ss.SSS for time. The date and time formats are slightly different when you run the analysis with the SQL engine.

How to set indicators on columns

The second step after defining the columns to be analyzed is to set either system or user-defined indicators for each of the defined columns.

How to set system or user-defined indicators

Prerequisite(s): A column analysis is open in the analysis editor in the Profiling perspective of the studio. For more information, see How to define the columns to be analyzed.

To set system indicators for the columns to be analyzed, do the following:

  1. From the Data preview view in the analysis editor, click Select indicators to open the [Indicator Selection] dialog box.

  2. From the [Indicator Selection] dialog box:

    • In the Data preview section, place the cursor on a row to display the complete data. This section lists the sample data you define in the analysis editor.

    • Click in the cells next to indicators names to set indicator parameters for the analyzed columns as needed. You can assign system or user-defined indicators to the columns.

    • Select the Hide non applicable indicators check box to hide the system and user-defined indicators that are not compatible with the engine you select to execute the analysis.

    • If required, change the order of columns by dropping them with the cursor.

      The order of the columns will be changed accordingly in the analysis editor.

    Note

    - It is not very useful to use Pattern Frequency Statistics on a column of a Date type in databases when executing the analysis with the SQL engine. No data quality issues are returned by this indicator as all dates will be displayed using one single format. For further information, check the article Why do we get 99-AAA-99 when profiling Date columns in Oracle.

    - If you attach the Date Pattern Frequency to a date column in your analysis, you can generate a date regular expression from the analysis results. For more information, see How to generate a regular expression from the Date Pattern Frequency.

  3. Click OK.

    Indicators are accordingly attached to the analyzed columns in the Analyzed Columns view.

    The analysis in this example provides/computes the following:

    • simple statistics on all columns. For further information about these indicators, see Simple statistics,

    • the characteristics of textual fields and the number of most frequent values for each distinct record in the fullname column . For further information, see Text statistics and Advanced statistics respectively,

    • patterns in the email column to show frequent and rare patterns so that you can identify quality issues more easily. For further information about these indicators, see Pattern frequency statistics,

    • the range, the inter quartile range and the mean and median values of the numeric data in the total_sales column. For further information about these indicators, see Summary statistics.

    • the frequency of the digits 1 through 9 in the sales figures to detect fraud. For further information, see Fraud Detection.

How to set options for system or user-defined indicators

You can define expected thresholds on the indicator's value. The threshold you define is used for measuring the quality of data. If the indicator's value is outside the defined threshold, then the data is of bad quality. You can define only one threshold or no threshold at all. You may set these thresholds either by value or by percentage with respect to the row count.

Prerequisite(s): A column analysis is open in the analysis editor in the Profiling perspective of the studio. For more information, see How to define the columns to be analyzed.

For more information about setting indicators, see How to set system or user-defined indicators.

To set options for system or user-defined indicators, do the following:

  1. In the Analyzed Columns view in the analysis editor, click the option icon next to the indicator.

  2. In the open dialog box, set the parameters for the given indicator.

    For example, if you want to flag if there are null values in the column you analyze, you can set 0 in the Upper threshold field for the Null Count indicator.

    Indicators settings dialog boxes differ according to the parameters specific for each indicator. For more information about different indicator parameters, see Indicator parameters.

  3. Click Finish to close the dialog box.

  4. Save the analysis.

How to set user-defined indicators from the analysis editor

Prerequisite(s):

To set user-defined indicators from the analysis editor for the columns to be analyzed, do the following:

  1. Either:

    1. In the analysis editor and from the Analyzed Columns view, click the icon next to the column name to which you want to define the indicator.

      The [UDI selector] dialog box opens.

    2. Select the user-defined indicators and then click OK.

    Or:

    1. In the DQ Repository tree view, expand Libraries > Indicators.

    2. From the User Defined Indicator folder, drop the user-defined indicator(s) against which you want to analyze the column content to the column name in the Analyzed Columns view.

      The user-defined indicator is listed under the column name.

    3. If required, set a threshold for the user-defined indicator.

      For further information, see How to set options for system or user-defined indicators.

  2. Save the analysis.