Transforming messages to words

Procedure

Double-click the tModelEncoder component labeled Tokenize to open its Component view. This component tokenize the SMS messages into words.
Click the Sync columns button to retrieve the schema from the preceding one.
Click the [...] button next to Edit schema to open the schema editor.
On the output side, click the [+] button to add one row and in the Column column, rename it to sms_tokenizer_words. This column is used to carry the tokenized messages.
In the Type column, select Object for this sms_tokenizer_words row.
Click OK to validate these changes.
In the Transformations table, add one row by clicking the [+] button and then proceed as follows:
1. In the Input column column, select the column that provides data to be transformed to features. In this scenario, it is sms_contents.
2. In the Output column column, select the column that carry the features. In this scenario, it is sms_tokenizer_words.
3. In the Transformation column, select the algorithm to be used for the transformation. In this scenario, it is Regex tokenizer.
4. In the Parameters column, enter the parameters you want to customize for use in the algorithm you have selected. In this scenario, enter pattern=\\W;minTokenLength=3.

Results

Using this transformation, tModelEncoder splits each input message by whitespace, selects only the words contains at least 3 letters and put the result of the transformation in the sms_tokenizer_words column. Thus currency symbols, numeric values, punctuations and words such as a, an or to are excluded from this column.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here