Transforming messages to words

Transforming messages to words - 7.3

Machine Learning

Version

7.3

Language

English

Product

Talend Big Data

Talend Big Data Platform

Talend Data Fabric

Talend Real-Time Big Data Platform

Module

Talend Studio

Content

Data Governance > Third-party systems > Machine Learning components

Data Quality and Preparation > Third-party systems > Machine Learning components

Design and Development > Third-party systems > Machine Learning components

Last publication date

2024-02-21

Deprecated

Procedure

Double-click the tModelEncoder component labelled Tokenize to open its Component view. This component tokenize the SMS messages into words.
Click the Sync columns button to retrieve the schema from the preceding one.
Click the [...] button next to Edit schema to open the schema editor.
On the output side, click the [+] button to add one row and in the Column column, rename it to sms_tokenizer_words. This column is used to carry the tokenized messages.
In the Type column, select Object for this sms_tokenizer_words row.
Click OK to validate these changes.
In the Transformations table, add one row by clicking the [+] button and then proceed as follows:
1. In the Input column column, select the column that provides data to be transformed to features. In this scenario, it is sms_contents.
2. In the Output column column, select the column that carry the features. In this scenario, it is sms_tokenizer_words.
3. In the Transformation column, select the algorithm to be used for the transformation. In this scenario, it is Regex tokenizer.
4. In the Parameters column, enter the parameters you want to customize for use in the algorithm you have selected. In this scenario, enter pattern=\\W;minTokenLength=3.

Results

Using this transformation, tModelEncoder splits each input message by whitespace, selects only the words contains at least 3 letters and put the result of the transformation in the sms_tokenizer_words column. Thus currency symbols, numeric values, punctuations and words such as a, an or to are excluded from this column.