Creating a time correlation analysis

Talend Platform for Enterprise Integration Studio User Guide

EnrichVersion
5.6
EnrichProdName
Talend Platform for Enterprise Integration
task
Design and Development
Data Quality and Preparation
EnrichPlatform
Talend Studio

In the example below, you want to create time correlation analysis to compute the minimal and maximal birth dates for each listed country in the selected nominal column. Two columns are used for the analysis: birthdate and country.

Note

The time correlation analysis is possible only on database columns for the time being. You can not use this analysis on file connections.

Prerequisite(s): At least one database connection is set in the Profiling perspective of the studio. For further information, see Connecting to a database.

Defining the analysis

  1. In the DQ Repository tree view, expand the Data Profiling folder.

  2. Right-click the Analyses folder and select New Analysis.

    The [Create New Analysis] wizard opens.

  3. Start typing time in the filter field, select Time Correlation Analysis and click Next.

  4. In the Name field, enter a name for the current analysis.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  5. Set the analysis metadata (purpose, description and author name) in the corresponding fields and then click Finish.

    A folder for the newly created analysis is listed under Analysis in the DQ Repository tree view, and the analysis editor opens on the analysis metadata.

Selecting the columns you want to analyze and setting analysis parameters

  1. In the analysis editor and from the Connection list, select the database connection on which to run the analysis.

    The time correlation analysis is possible only on database columns for the time being. You can change your database connection by selecting another connection from the Connection list. If the analyzed columns do not exist in the new database connection you want to set, you receive a warning message that enables you to continue or cancel the operation.

  2. Click Select columns to analyze to open the [Column Selection] dialog box and select the columns, or drag them directly from the DQ Repository tree view into the Analyzed Columns view.

    If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected column will be automatically located under the corresponding connection in the tree view.

  3. If required, click in the Indicators view to open a dialog box where you can set thresholds for each indicator.

    The indicators representing the simple statistics are by-default attached to this type of analysis.

  4. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.

  5. In the Analysis Parameterview and in the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to the selected database connection, if required.

    You can set this number according to the database available resources, that is the number of concurrent connections each database can support.

  6. If you have defined context variables in the analysis editor:

    • use the Data Filter and Analysis Parameter views to set/select context variables to filter data and to decide the number of concurrent connections per analysis respectively.

    • In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.

    For further information about contexts and variables, see Using context variables in analyses.

  7. Click the save icon on top of the editor and press F6 to execute the column comparison analysis.

    The graphical result is displayed in the Graphics panel to the right.

    This gantt chart displays a range showing the minimal and maximal birth dates for each country listed in the selected nominal column. It also highlights the range bars that contain null values for birth dates.

    For example, in the above chart, the minimal birth date for Mexico is 1910 and the maximal is 2000. And of all the data records where the country is Mexico, 41 records have null value as birth date.

From the generated graphic, you can:

  • place the pointer on any of the range bars to display the correlated data values at that position,

  • put the pointer on a specific birth date and drag it to another birth date to change the chart and show the minimal and maximal birth dates related only to your selection.

  • right-click any of the range bars and select:

    Option

    To...

    Show in full screen

    open the generated graphic in a full screen

    View rows

    access a list of all analyzed rows in the selected nominal column

The below figure illustrates an example of the SQL editor listing the correlated data values at the selected range bar.

Note

From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed on indicators.

For more information on the gantt chart, see Accessing the detailed view of the analysis results.