Converting the tokenized text to the CoNLL format - 7.0

Natural Language Processing

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Natural Language Processing
Data Quality and Preparation > Third-party systems > Natural Language Processing
Design and Development > Third-party systems > Natural Language Processing
EnrichPlatform
Talend Studio
To be able to learn a classification model from a text, you must divide this text into tokens and convert it to the CoNLL format using tNormalize.

Procedure

  1. Double click the tNLPPreprocessing component to open its Basic settings view and define its properties.
    1. Click Sync columns to retrieve the schema from the previous component connected in the Job.
    1. From the NLP Library list, select the library to be used for tokenization. In this example, ScalaNLP is used.
  2. From the Column to preprocess list, select the column that holds the text to be divided into tokens, which is message in this example.
  3. Double click the tFilterColumns component to open its Basic settings view and define its properties.
  4. Click Edit schema to add the tokens column in the output schema because this is the column to be normalized, and click OK to validate.
  5. Double click the tNormalize component to open its Basic settings view and define its properties.
    1. Click Sync columns to retrieve the schema from the previous component connected in the Job.
    2. From the Column to normalize list, select tokens.
    3. In the Item separator field, enter "\t" to separate tokens using a tab in the output file.
  6. Double click the tFileOutputDelimited component to open its Basic settings view and define its properties.
    1. Click Sync columns to retrieve the schema from the previous component connected in the Job.
    2. In the Folder field, specify the path to the folder where the CoNLL files will be stored.
    3. In the Row Separator field, enter "\n".
    4. In the Field Separator field, enter "\t" to separate fields with a tab.
  7. Press F6 to save and execute the Job.

Results

The output files are created in the specified folder. The files contain a single column with one token per row.

You can then manually label person names with PER and the other tokens with O before you can learn a classification model from this text data: