Scenario 4: Using two parsing levels to extract information from unstructured data - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes how to build a set of rules to extract some information from unstructured data. It explains how to use a basic ANTLR rule to tokenize data then how to use an advanced rule to check each token created by ANTLR against a regular expression.

This scenario uses:

  • a tFixedFlowInput component to create the unstructured data strings.

  • a tStandardizeRow component to define the rules necessary to extract the liquid amounts from the data strings.

  • a tLogRow component to display the output data.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tStandardizeRow and tLogRow.

  2. Connect the three components together using the Main links.

Creating the unstructured data

  1. Double-click the tFixedFlowInput component to display its Basic settings view.

  2. Click the [...] button to open the [Schema] dialog box, click the [+] button to add a column, name the column product and finally click OK to validate and close the box.

  3. In the Mode area, select Use Inline Content (delimited file).

  4. In the Content field, enter the following three strings:

    3M PROJECT LAMP 7 LUMENS 32ML
    A 5 LUMINES 5 LOW VANILLA 5L 5LIGHT 5 L DULUX L
    54MLP FAC 32 ML

Creating the parsing rules

  1. Double-click the tStandardizeRow component to display its Basic settings view.

  2. From the Column to parse list, select product.

  3. In the Conversion rules table, define a basic rule and an advanced rule as the following:

    • Click twice on the [+] button to add two columns. Name the first as "Amount" and the second as "LiquidAmount".

    • Select Format as the type for the basic rule, and define it to read "INT WHITESPACE* WORD".

    • Select RegExp as the type for the advanced rule, and define it to read "\\d+\\s*(L|ML)\\b".

      The advanced rule will be executed after the basic ANTLR rule. The "Amount" rule will tokenize the amounts in the three strings, it matches any word with a numeric in front of it. Then the RegExp rule will check each token created by ANTLR against a regular expression.

  4. Click the Generate parser code in Routines button in order to generate the code under the Routines folder in the DQ Repository tree view of the Profiling perspective.

    This step is mandatory, otherwise the Job will not be executed.

  5. In the Advanced settings view, leave the options selected by default in the Output format area as they are.

    The Max edits for fuzzy match is set to 1 by default.

  6. Double-click the tLogRow component and select the Table (print values in cells of a table) option in the Mode area.

Executing the Job

  • Save your Job and press F6 to execute it.

    The liquid amount has been extracted from the unstructured data by using a basic ANTLR rule that has tokenized amounts followed by an advanced rule that has checked each token created by ANTLR against the regular expression.

    Each instance of the XML data is written on a separate row because the Pretty print check box is selected in the Advanced settings of the tStandardizeRow component.