Transforming messages to words

Machine Learning

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Big Data
task
Data Quality and Preparation > Third-party systems > Machine Learning components
Data Governance > Third-party systems > Machine Learning components
Design and Development > Third-party systems > Machine Learning components
EnrichPlatform
Talend Studio

Procedure

  1. Double-click the tModelEncoder component labelled Tokenize to open its Component view. This component tokenize the SMS messages into words.
  2. Click the Sync columns button to retrieve the schema from the preceding one.
  3. Click the [...] button next to Edit schema to open the schema editor.
  4. On the output side, click the [+] button to add one row and in the Column column, rename it to sms_tokenizer_words. This column is used to carry the tokenized messages.
  5. In the Type column, select Object for this sms_tokenizer_words row.
  6. Click OK to validate these changes.
  7. In the Transformations table, add one row by clicking the [+] button and then proceed as follows:
    1. In the Input column column, select the column that provides data to be transformed to features. In this scenario, it is sms_contents.
    2. In the Output column column, select the column that carry the features. In this scenario, it is sms_tokenizer_words.
    3. In the Transformation column, select the algorithm to be used for the transformation. In this scenario, it is Regex tokenizer.
    4. In the Parameters column, enter the parameters you want to customize for use in the algorithm you have selected. In this scenario, enter pattern=\\W;minTokenLength=3.

Results

Using this transformation, tModelEncoder splits each input message by whitespace, selects only the words contains at least 3 letters and put the result of the transformation in the sms_tokenizer_words column. Thus currency symbols, numeric values, punctuations and words such as a, an or to are excluded from this column.