Exploring semantic categories of data columns - Cloud - 7.3

Talend Studio User Guide

Version
Cloud
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Cloud
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Design and Development
Last publication date
2024-02-13
Available in...

Big Data Platform

Cloud API Services Platform

Cloud Big Data Platform

Cloud Data Fabric

Cloud Data Management Platform

Data Fabric

Data Management Platform

Data Services Platform

MDM Platform

Real-Time Big Data Platform

About this task

The example below uses a database table which holds customer information.

Procedure

  1. In the DQ Repository tree view, expand Metadata and browse to the table you want to analyze.
  2. Right-click the table and select Semantic-aware Analysis, or right-click a set of columns in the table and select Semantic-aware Analysis.

    The semantic wizard opens listing all the columns of the table or listing the selected set of columns depending on whether you started the analysis on a table or on a set of columns respectively. The Category line in the wizard assigns semantic categories for the matched columns.

  3. Configure the Sampling Options in the related section:
    Select or click To...
    - First N Rows

    - Reservoir Sampling

    list in the data preview N first data records from the selected columns. You set the number of records in the Number of rows field.

    list in the data preview N random records from the selected columns. You set the number of records in the Number of rows field.

    Threshold for category discovery decide the minimum threshold for the matches to show in the Category lists of the analyzed columns.

    This threshold filters the less probable categories of the analyzed columns.

    Refresh refresh the data preview after any change in the configuration.
  4. From the Category field of each of the matched columns, either:
    • Select a category of data from the Category list that best suites the column, or
    • Enter a meaningful name for the column that best represent the content.
  5. To edit the name of a column, click in the field twice, type the name and press Enter on your keyboard to save the changes.
    The names entered by you will display in a different color. This step stores locally the categories and the semantic names of the columns. If no semantic names are found, categories are stored anyway.
    This is not mandatory but will help you better match table metadata with the concepts stored in the ontology repository on the log server.

    The percentages of the proposed categories are calculated by analyzing the data in the columns against the following methods: regex, data dictionary and keyword dictionary. The dictionary indexes and regex categories are embedded in the Studio and are used to decide what category does the data fall in.

  6. Click Next to open a page in the wizard where you can see the results of matching column metadata and semantic concepts with the concepts in the ontology repository.