Creating a pre-defined table analysis - 6.2

Talend Data Services Platform Studio User Guide

EnrichVersion
6.2
EnrichProdName
Talend Data Services Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

From the studio, you can use the Semantic-aware approach to create table analyses preconfigured with indicators and patterns that best suite data.

Prerequisite(s):

  • You have installed Talend Log Server using the Installer.

  • You have created a connection to a data source in the Studio, whether it is a database, a delimited file or Hive.

Launching the server and setting preferences

  1. Launch the elasticsearch server installed by the Installer and stored in the logserv folder in the root directory.

  2. On the menu bar of the Studio, select Window > Preferences to display the [Preferences] window.

  3. Start typing Semantic in the filter field.

    The Semantic-aware Analysis view is displayed.

    The connection information to the semantic repository on the server is set by default depending on your installation.

    If you do any modifications to the port or cluster name, you must modify them in this view.

  4. Click Check Connection to verify if the connection is successful before clicking OK.

    An error message is displayed if the connection information to the log server is not correctly set or if the log server is not up and running.

Exploring semantic categories of data columns

The example below uses a database table which holds customer information.

  1. In the DQ Repository tree view, expand Metadata and browse to the table you want to analyze.

  2. Right-click the table and select Semantic-aware Analysis, or right-click a set of columns in the table and select Semantic-aware Analysis.

    The semantic wizard opens listing all the columns of the table or listing the selected set of columns depending on whether you started the analysis on a table or on a set of columns respectively. The Category line in the wizard assigns semantic categories for the matched columns.

  3. In the Sampling Options section:

    Select/Click

    To...

    - First N Rows

    - Reservoir Sampling

    - list in the data preview N first data records from the selected columns. You set the number of records in the Number of rows field.

    - list in the data preview N random records from the selected columns. You set the number of records in the Number of rows field.

    Threshold for category discovery

    decide the minimum threshold for the matches to show in the Category lists of the analyzed columns.

    This threshold filters the less probable categories of the analyzed columns.

    Refresh

    refresh the data preview after any change in the configuration.

  4. From the Category field of each of the matched columns, either:

    • select a category of data from the Category list that best suites the column, or

    • enter a meaningful name for the column that best represent the content.

      To do this, click in the field twice, type the name and press Enter on your keyboard to save the changes. The names entered by you will display in a different color. This step stores locally the categories and the semantic names of the columns. If no semantic names are found, categories are stored anyway.

    This is not mandatory but will help you better match table metadata with the concepts stored in the Ontology repository on the log server.

    The percentages of the proposed categories are calculated by analyzing the data in the columns against the following methods: regex, data dictionary and keyword dictionary. The dictionary indexes and regex categories are embedded in the Studio and are used to decide what category does the data fall in. For further information about dictionary indexes and regex categories, see the Knowledge Base article Indexes and regex categories used in the Semantic-aware analysis.

  5. Click Next to open a page in the wizard where you can see the results of matching column metadata and semantic concepts with the concepts in the Ontology repository.

Matching column metadata and semantic categories with the concepts in the Ontology repository

After exploring the semantic categories of data as outlined in Exploring semantic categories of data columns, the wizard opens on a chart which represents the results of matching column metadata and new semantic concepts with concepts from the Ontology repository.

A new Semantics line is added to the table. This line corresponds to the attributes found in the Ontology repository as a result of the match operation.

The most relevant concepts are selected by default and all columns associated with the concept are highlighted in the table.

  1. If required, select another concept in the chart

    The generated analysis will be based on this selection.

  2. Click Next to open a new page in the wizard where you can configure what to enrich the Ontology repository with.

Enriching the Ontology repository

This page of the wizard shows the selected concept and its related attributes. Also, a new line is added to the table: Enrich Action.

All what you define on columns in this page is used to enrich the Ontology repository on the log server.

  1. From the Semantic lists for each column, select a new attribute.

    Defining concepts and attributes for columns is important for the choice of indicators to be used on the columns even if you do not enrich the Ontology repository with.

  2. From the Action lists for each column, select if you want to add the new attributes to the repository on the log server or how do you want to add them.

    Concepts in the repository will be enriched with synonyms and new attributes.

    The semantic list may be different from one column to the other depending on the content of Category and Semantics fields.

  3. Click Next and in the new window, check the Validated status column to make sure the actions you want to do on the ontology repository are valid.

    The status is represented as the following:

    • When the concept to add can be matched with concepts from the Ontology but does not exist already in the ontology repository, then the status is valid.

    • When the concept to add can not be matched with concepts from the Ontology, then the status is invalid with a warning icon.

    • When the concept to add can be matched with concepts from the Ontology and already exist in the ontology repository, then the status is invalid with a red warning icon.

    You can always change your selection in the previous window or clear the check box of the action you want to cancel.

    Click Run enrichment to enrich the Ontology repository with the selected attributes.

    The result view at the bottom of the wizard displays a message to confirm what has been added to the Ontology repository.

  4. Click:

    • Finish to create the table analysis with a default name.

    • Next to open a page in the wizard where you can set analysis metadata.

Defining the recommended table analysis

In the analysis metadata page of the wizard:

  1. Set the analysis metadata (name, purpose and description) and click Finish.

    The analysis editor opens with the recommended indicators already assigned to the columns.

  2. If required, click Select Indicators to open a dialog box and modify the indicators assigned to the columns.

    You can also add patterns to the columns from this dialog box.

  3. Run the analysis.

    Analysis results are displayed in the studio and also registered in the Ontology repository on the log server.

    The Ontology repository is enriched with the information about what indicators are used on each type of column. It is also enriched with results such as the minimum and maximum values used on indicators and the thresholds used on patterns.

    Results like min and max values are important to define a range on numeric columns. This range is updated in the Ontology repository according to the following rules:

    • If you do not define a threshold on the min/max indicator in the studio, and if the min/max indicator value is less/greater than the attribute min/max value in the Ontology repository, then the attribute min/max value is updated with the new value of the indicator.

    • If you set some thresholds on the indicator in the studio, then the min and/or max threshold will update the attribute min and/or max value in the Ontology repository every time the analysis is run.

    When you try to create a table analysis with similar columns, all the registered indicators and patterns are used on the columns by default.

For further information about dictionary indexes and regex categories, see the Knowledge Base article Indexes and regex categories used in the Semantic-aware analysis.

For further information about the content of the Ontology repository, see the Knowledge Base article Accessing semantic concepts stored in the Ontology repository.