Natural Language Processing using Talend Studio

Using Talend Studio and machine learning on Spark, you can teach computers to understand how humans learn and use natural language.

What is natural language processing?

Natural language processing tasks include:

text tokenization, which divides a text into basic units such as words or punctuation marks;
sentence splitting, which divides the input into sentences, based on ending characters, such as periods or question marks; and
named entity recognition, which finds and classify person names, dates, locations and organizations in a text.

Natural language processing is useful to:

extract person names or company names from textual resources;
group forum discussions together by topics;
find discussions where people are mentioned but don't participate to the discussion; or
link entities.

Natural language processing can help you create links between user profiles and mentions in the text, between persons and organizations, or between persons and any other information that may be used for re-identification.

Workflow

Machine learning with Spark is usually two phases: the first phase computes a model based on historical data and mathematical heuristics, and the second phase applies the model on text data. In Talend Studio, the first phase is implemented by two Jobs:

the first one with the tNLPPreprocessing and the tNormalize components; and
the second one with the tNLPModel component.

While the second phase is implemented by a third Job with the tNLPPredict component.

In this workflow, tNLPPreprocessing:

divides a text sample in tokens; and
cleans the text sample by removing all HTML tags.

Then, tNormalize converts tokens to the CoNLL format.

You can then manually label the tokens and add optional features by editing the files. For example, you can label person names with PER:

Next, you can use the tokenized sample text you labeled with tNLPModel in the second Job where tNLPModel:

generates fatures for each token; and
trains a classification model.

tNLPPredict labels text data automatically using the classification model generated by tNLPModel.

For example, you can extract named entities with <PER> labels: