Extracting the stems of English words from a specific DB column

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real-Time Big Data Platform, Talend MDM Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

For more technologies supported by Talend, see Talend components.

This scenario describes a six-component Job that carries out linguistic normalization on data in the translation column and extract the base part (word stem) of all English words.

The aim of this Job is to create a kind of dictionary of stems of the English words listed in the translation column. This dictionary may be used at a later stage in order to check new words to be put in the selected table. The extracted English stems are written in an output file along with the number of their occurrences in the translation column.

In this scenario, you have already stored the main input schema in the Repository. For more information about storing schema metadata in the Repository, see Managing metadata in Talend Studio.

The main input table contains eight columns: id_key, id_lang, translation, id_status, id_user_trans, id_user_validate, id_editor and date. We want to extract the stem of the English words in the translation column.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here

Extracting the stems of English words from a specific DB column

In this section

Did this page help you?