Open Studio for Data Quality
Before you begin
- The column analysis is open in the analysis editor.
- You have set system or predefined indicators for the column analysis.
- You have installed in the Studio the SQL explorer libraries that are required for data quality.
- In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.
In the Analysis Parameters View, do the following:
In the Number of connections per analysis field,
set the number of concurrent connections allowed per analysis to the
selected database connection.
You can set this number according to the database available resources, that is the number of concurrent connections each database can support.Note: Connection concurrency is not supported when using a connection to a SQLite database or a Hive database on Spark. Connection concurrency is supported when using a connection to a Hive2 server.
From the Execution engine list, select the engine,
Java, or SQL, you want to use to execute the analysis.
If you select the Java engine:
- Select the Allow drill down check box to be able to drill down the results of all indicators in the Analysis Results view. It does not apply to the Row Count indicator.
- In the Max number of rows kept per indicator field, set the number of the data rows you want to drill down.
- In the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to the selected database connection.
If you have defined context variables in the Contexts view in the analysis editor, do the following:
- use the Data Filter and Analysis Parameter views to set/select context variables to filter data and to decide the number of concurrent connections per analysis respectively.
- In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.
Save the analysis and press F6 to execute it.
The editor switches to the Analysis Results view.When you use the SQL engine, the analysis runs multiple indicators in parallel and results are refreshed in the charts while the analysis is still in progress.Below are the graphics representing the Frequency and Text Statistics for the fullname column.For further information about the Frequency and Text Statistics, see Advanced statistics and Text statistics respectively.Below are the graphics representing the Pattern Frequency and Pattern Low Frequency statistics for the email column.The patterns in the table use a and A to represent the email values. Each pattern can have till 30 characters. If the total number of characters exceeds 30, the pattern is represented as the following: aaaaaAAAAAaaaaaAAAAAaaaaaAAAAA...<total number of characters>, and you can place your pointer on the pattern in the table to get the original value.For further information about these indicators, see Pattern frequency statistics.Below are the graphics representing the Summary Statistics for the total_sales column.For further information about these indicators, see Summary statistics.And below are the graphics representing the order of magnitude and the Benford's law statistics for the total_sales column.For further information about the Benford's law statistics usually used as an indicator of accounting and expenses fraud in lists or tables, see Fraud Detection.
When you select the Java engine, the system will look for Java regular expressions first, if none is found, it looks for SQL regular expressions.
If you execute this analysis using the SQL engine, you can view the executed query for each of the attached indicators if you right-click an indicator and then select the View executed query option from the list. However, when you use the Java engine, SQL queries will not be accessible and thus clicking this option will open a warning message.
For more information on the Java engine, see Using the Java or the SQL engine.