Creating a pre-defined table analysis - 6.1

Talend Real-time Big Data Platform Studio User Guide

EnrichVersion
6.1
EnrichProdName
Talend Real-Time Big Data Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

From the studio, you can use the Semantic Discovery feature to create table analyses preconfigured with indicators and patterns that best suite data.

Prerequisite(s):

  • You have installed Talend Log Server using the Installer.

  • You have created a connection to a data source in the Studio, whether it is a database, a delimited file or Hive.

Launching the server and setting preferences

  1. Launch the elasticsearch server installed by the Installer and stored in the logserv folder in the root directory.

  2. On the menu bar of the Studio, select Window > Preferences to display the [Preferences] window.

  3. Start typing Semantic Discovery in the filter field.

    The Semantic Discovery view is displayed.

    The connection information to the semantic repository on the server is set by default depending on your installation.

    If you do any modifications to the port or cluster name, you must modify them in this view.

  4. Click Check Connection to verify if the connection is successful before clicking OK.

Exploring semantic categories of data columns

The example below uses a database table which holds customer information.

  1. In the DQ Repository tree view, expand Metadata and browse to the table you want to analyze.

  2. Right-click the table and select Semantic Discovery, or right-click a set of columns in the table and select Semantic Discovery.

    The semantic discovery wizard opens listing all the columns of the table or listing the selected set of columns depending on whether you started the analysis on a table or on a set of columns respectively. The Category line in the wizard assigns semantic categories for the matched columns.

  3. In the [Semantic Category Inference] page:

    Select/Click

    To...

    - First N Rows

    - Reservoir Sampling

    - list in the data preview N first data records from the selected columns. You set the number of records in the Number of rows field.

    - list in the data preview N random records from the selected columns. You set the number of records in the Number of rows field.

    a percentage in Threshold for meaning discovery

    decide the minimum threshold for the matches to show in the Meaning lists of the analyzed columns.

    This threshold filters the less probable categories of the analyzed columns.

    Refresh

    refresh the data preview after any change in the configuration.

  4. From the Category field of each of the matched columns, either:

    • select a category of data from the Category list that best suites the column, or

    • enter a meaningful name for the column that best represent the content.

      To do this, click in the field twice, type the name and press Enter on your keyboard to save the changes. The names entered by you will display in a different color. This step stores locally the categories and the semantic names of the columns. If no semantic names are found, categories are stored anyway.

    This is not mandatory but will help you better match table metadata with the concepts stored in the Ontology repository on the log server.

    The percentages of the proposed categories are calculated by analyzing the data in the columns against the following methods: regex, data dictionary and keyword dictionary. The dictionary indexes and regex categories are embedded in the Studio and are used to decide what category does the data fall in. For further information about dictionary indexes and regex categories, see the Knowledge Base article Dictionary indexes used in the Semantic Discovery analysis.

  5. Click Next to open a page in the wizard where you can see the results of matching column metadata and semantic concepts with the concepts in the Ontology repository.

Matching column metadata and semantic categories with the concepts in the Ontology repository

After exploring the semantic categories of data as outlined in Exploring semantic categories of data columns, the wizard opens on a chart which represents the results of matching column metadata and new semantic concepts with concepts from the Ontology repository.

A new Semantics line is added to the table. This line corresponds to the attributes found in the Ontology repository as a result of the match operation.

The most relevant concepts are selected by default and all columns associated with the concept are highlighted in the table.

  1. If required, select another concept in the chart

    The generated analysis will be based on this selection.

  2. Click Next to open a new page in the wizard where you can configure what to enrich the Ontology repository with.

Enriching the Ontology repository

This page of the wizard shows the selected concept and its related attributes. Also, a new line is added to the table: Enrich Action.

All what you define on columns in this page is used to enrich the Ontology repository on the log server.

  1. From the Semantic lists for each column, select a new attribute.

    Defining concepts and attributes for columns is important for the choice of indicators to be used on the columns even if you do not enrich the Ontology repository with.

  2. From the Action lists for each column, select if you want to add the new attributes to the repository on the log server or how do you want to add them.

    Concepts in the repository will be enriched with synonyms and new attributes.

    The semantic list may be different from one column to the other depending on the content of Category and Semantics fields.

  3. Click Next and in the new window, check the Validated status column to make sure the actions you want to do on the ontology repository are valid.

    The validation status is as the following:

    • When the concept to add can be matched with concepts from the Ontology but does not exist already in the ontology repository, then the status is valid.

    • When the concept to add can not be matched with concepts from the Ontology, then the status is invalid with a warning icon.

    • When the concept to add can be matched with concepts from the Ontology and already exist in the ontology repository, then the status is invalid with a red warning icon.

    You can always change your selection in the previous window or clear the check box of the action you want to cancel.

    Click Run enrichment to enrich the Ontology repository with the selected attributes.

    The result view at the bottom of the wizard displays a message to confirm what has been added to the Ontology repository.

  4. Click:

    • Finish to create the table analysis with a default name.

    • Next to open a page in the wizard where you can set analysis metadata.

Defining the recommended table analysis

In the analysis metadata page of the wizard:

  1. Set the analysis metadata (name, purpose and description) and click Finish.

    The analysis editor opens with the recommended indicators already assigned to the columns.

  2. If required, click Select Indicators to open a dialog box and modify the indicators assigned to the columns.