Different rule types for different parsing levels - 7.3

Standardization

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Standardization components
Data Quality and Preparation > Third-party systems > Data Quality components > Standardization components
Design and Development > Third-party systems > Data Quality components > Standardization components
Last publication date
2024-02-21

The tStandardizeRow component uses basic rules based on ANTLR grammar and advanced rules defined by Talend and not based on ANTLR.

Sometimes, using ANTLR rules can not answer all your expectations when normalizing and standardizing data. Suppose, for example, that you want to extract the liquid amount in the following three records:
3M PROJECT LAMP 7 LUMENS 32ML
A 5 LUMINES 5 LOW VANILLA 5L 5LIGHT 5 L DULUX L
54MLP FAC 32 ML

You may start by defining a liquid unit and a liquid amount in basic parser rules as the following:

If you test these rules in the Profiling perspective of studio, you can see that these rules extract 7 L from 7 LUMENS and this is not what you expect. You do not want that the word LUMENS is split into two tokens.

The basic rules you have defined above are ANTLR lexer rules and lexer rules are used for tokenizing the input string. ANTLR does not provide a word boundary symbol like the \b used in regular expressions. You must then be careful when choosing lexer rules because they define how the input strings will be split in tokens.

You can solve such a problem using two approaches:

The first approach is to define another basic rule that matches a word with a numeric value in front of it, the Amount rule in this example:

This basic rule is a lexer rule, a Format rule that starts with an uppercase. If you test this rule in the Profiling perspective of the Studio, you can see that non liquid amounts are matched by this rule and the LiquidAmount rule only matches the expected sequence of characters.

The second approach is to use an advanced rule like a regular expression and define a word boundary with \b. You can use a lexer rule to tokenize amounts where you match any word with a numeric in front of it. Then use a regular expression that matches liquid amounts as the following: a digit optionally followed by space and followed by L or ML and terminated by a word boundary.

Note that the regular expression will be applied on the tokens created by the basic lexer rule.

You can not check the results of the advanced rule by testing the rule in the Profiling perspective of the Studio as you do with basic rules. The only means to see the results of advanced rules is by using them in a Job. The results will look as the following:
3M PROJECT LAMP 7 LUMENS 32ML
<record>
	<Amount>3M</Amount> 
	<Amount>7 LUMENS</Amount>
	<LiquidAmount>32ML</LiquidAmount> 
	<UNMATCHED> 
		<CAPWORD>PROJECT</CAPWORD> 
		<CAPWORD>LAMP</CAPWORD> 
	</UNMATCHED> 
</record>

For a Job example about the use of the above rules, see Using two parsing levels to extract information from unstructured data.

Comparing these two approaches, the first one uses only ANTLR grammar and may be more efficient than the second solution which requires a second parsing pass to check each token against the regular expression. But regular expressions can help people experienced in regular expressions to create more advanced rules that could hardly be created using ANTLR only.