Finalizing and executing the column analysis - 6.2

Talend Real-time Big Data Platform Studio User Guide

EnrichVersion
6.2
EnrichProdName
Talend Real-Time Big Data Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

After defining the column(s) to be analyzed and setting indicators, you may want to filter the data that you want to analyze and decide what engine to use to execute the column analysis.

Prerequisite(s):

  • The column analysis is open in the analysis editor in the Profiling perspective of the studio . For more information, see How to define the columns to be analyzed.

  • You have set system or predefined indicators for the column analysis. For more information, see How to set indicators on columns.

  • You have installed in the studio the SQL explorer libraries that are required for data quality.

To finalize the column analysis defined in Defining the columns to be analyzed and setting indicators, do the following:

  1. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.

  2. In the Analysis Parameters View:

    • In the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to the selected database connection.

      You can set this number according to the database available resources, that is the number of concurrent connections each database can support.

    • From the Execution engine list, select the engine, Java or SQL, you want to use to execute the analysis.

      If you select the Java engine:

      • select the Allow drill down check box to be able to drill down, in the Analysis Results view, the results of all indicators except Row Count.

      • in the Max number of rows kept per indicator field, set the number of the data rows you want to drill down.

    For further information about these engines, see Using the Java or the SQL engine.

  3. If you have defined context variables in the Contexts view in the analysis editor:

    • use the Data Filter and Analysis Parameter views to set/select context variables to filter data and to decide the number of concurrent connections per analysis respectively.

    • In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.

    For further information about contexts and variables, see Using context variables in analyses.

  4. Save the analysis and press F6 to execute it.

    The editor switches to the Analysis Results view.

    When you use the SQL engine, the analysis runs multiple indicators in parallel and results are refreshed in the charts while the analysis is still in progress.

    Below are the graphics representing the Frequency and Text Statistics for the fullname column.

    For further information about the Frequency and Text Statistics, see Advanced statistics and Text statistics respectively.

    Below are the graphics representing the Pattern Frequency and Pattern Low Frequency statistics for the email column.

    The patterns in the table use a and A to represent the email values. Each pattern can have till 30 characters. If the total number of characters exceeds 30, the pattern is represented as the following: aaaaaAAAAAaaaaaAAAAAaaaaaAAAAA...<total number of characters>, and you can place your pointer on the pattern in the table to get the original value.

    For further information about these indicators, see Pattern frequency statistics.

    Below are the graphics representing the Summary Statistics for the total_sales column.

    For further information about these indicators, see Summary statistics.

    And below are the graphics representing the order of magnitude and the Benford's law statistics for the total_sales column.

    For further information about the Benford's law statistics usually used as an indicator of accounting and expenses fraud in lists or tables, see Fraud Detection.

If you execute this analysis using the Java engine and then select the Allow drill down check box in the Analysis parameters view, you can store locally the analyzed data and thus access it in the Analysis Results > Data view. You can use the Max number of rows kept per indicator field to decide the number of the data rows you want to make accessible.

When you select the Java engine, the system will look for Java regular expressions first, if none is found, it looks for SQL regular expressions.

Note

If you select to connect to a database that is not supported in the studio (using the ODBC or JDBC methods), it is recommended to use the Java engine to execute the column analyses created on the selected database. For more information on the java engine, see Using the Java or the SQL engine.

If you execute this analysis using the SQL engine, you can view the executed query for each of the attached indicators if you right-click an indicator and then select the View executed query option from the list. However, when you use the Java engine, SQL queries will not be accessible and thus clicking this option will open a warning message.

For more information on the Java and the SQL engines, see Using the Java or the SQL engine.