For more technologies supported by Talend, see Talend components.
This scenario describes a six-component Job that carries out linguistic normalization on data in the translation column and extract the base part (word stem) of all English words.
The aim of this Job is to create a kind of dictionary of stems of the English words listed in the translation column. This dictionary may be used at a later stage in order to check new words to be put in the selected table. The extracted English stems are written in an output file along with the number of their occurrences in the translation column.
In this scenario, we have already stored the main input schema in the Repository. For more information about storing schema metadata in the Repository, see Talend Studio User Guide.
The main input table contains eight columns: id_key, id_lang, translation, id_status, id_user_trans, id_user_validate, id_editor and date. We want to extract the stem of the English words in the translation column.