Extracting the stems of English words from a specific DB column - Cloud - 8.0

Text standardization

Version
Cloud
8.0
Language
English
Product
Talend Big Data Platform
Talend Data Fabric
Talend Data Management Platform
Talend Data Services Platform
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Standardization components > Text standardization components
Data Quality and Preparation > Third-party systems > Data Quality components > Standardization components > Text standardization components
Design and Development > Third-party systems > Data Quality components > Standardization components > Text standardization components
Last publication date
2024-02-20

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real-Time Big Data Platform, Talend MDM Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

For more technologies supported by Talend, see Talend components.

This scenario describes a six-component Job that carries out linguistic normalization on data in the translation column and extract the base part (word stem) of all English words.

The aim of this Job is to create a kind of dictionary of stems of the English words listed in the translation column. This dictionary may be used at a later stage in order to check new words to be put in the selected table. The extracted English stems are written in an output file along with the number of their occurrences in the translation column.

In this scenario, you have already stored the main input schema in the Repository. For more information about storing schema metadata in the Repository, see Managing metadata in Talend Studio.

The main input table contains eight columns: id_key, id_lang, translation, id_status, id_user_trans, id_user_validate, id_editor and date. We want to extract the stem of the English words in the translation column.