Natural Language Processing using Talend Studio - 7.0

Natural Language Processing

author
Talend Documentation Team
EnrichVersion
7.0
EnrichProdName
Talend Big Data Platform
Talend Data Fabric
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Natural Language Processing
Data Quality and Preparation > Third-party systems > Natural Language Processing
Design and Development > Third-party systems > Natural Language Processing
EnrichPlatform
Talend Studio
Using Talend Studio and machine learning on Spark, you can teach computers to understand how humans learn and use natural language.

What is natural language processing?

Natural language processing tasks include:
  • text tokenization, which divides a text into basic units such as words or punctuation marks;

  • sentence splitting, which divides the input into sentences, based on ending characters, such as periods or question marks; and

  • named entity recognition, which finds and classify person names, dates, locations and organizations in a text.

Natural language processing is useful to:
  • extract person names or company names from textual resources;

  • group forum discussions together by topics;

  • find discussions where people are mentioned but don't participate to the discussion; or

  • link entities.

Natural language processing can help you create links between user profiles and mentions in the text, between persons and organizations, or between persons and any other information that may be used for re-identification.

Workflow

Machine learning with Spark is usually two phases: the first phase computes a model based on historical data and mathematical heuristics, and the second phase applies the model on text data. In Talend Studio, the first phase is implemented by two Jobs:
  • the first one with the tNLPPreprocessing and the tNormalize components; and

  • the second one with the tNLPModel component.

While the second phase is implemented by a third Job with the tNLPPredict component.

In this workflow, tNLPPreprocessing:
  • divides a text sample in tokens; and

  • cleans the text sample by removing all HTML tags.

Then, tNormalize converts tokens to the CoNLL format.

You can then manually label the tokens and add optional features by editing the files. For example, you can label person names with PER:
Next, you can use the tokenized sample text you labeled with tNLPModel in the second Job where tNLPModel:
  • generates fatures for each token; and

  • trains a classification model.

tNLPPredict labels text data automatically using the classification model generated by tNLPModel.

For example, you can extract named entities with <PER> labels: