Rule types - 7.1

Standardization

author
Talend Documentation Team
EnrichVersion
Cloud
7.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Standardization components
Data Quality and Preparation > Third-party systems > Data Quality components > Standardization components
Design and Development > Third-party systems > Data Quality components > Standardization components
EnrichPlatform
Talend Studio

Two groups of rule types are provided: the basic rule types and the advanced rule types.

  • Basic rule types: Enumeration, Format and Combination. Rules of these types are composed with some given ANTLR symbols.

  • Advanced rule types: Regex, Index and Shape. Rules of these types match the tokenized data and standardize them when needed.

The advanced rule types are always executed after the ANTLR specific rules regardless of rule order. For further information about basic and advanced rules, see Different rule types for different parsing levels and Using two parsing levels to extract information from unstructured data.

To create the rules of any type, Talend provides the pre-defined and case-sensitive elements (ANTLR tokens) as follows for defining the composition of a string to be matched:

  • INT: integer;

  • WORD: word;

  • WORD+: literals of several words;

  • CAPWORD: capitalized word;

  • DECIMAL: decimal float;

  • FRACTION: fraction float;

  • CURRENCY: currencies;

  • ROMAN_NUMERAL: Roman numerals;

  • ALPHANUM: combination of alphabetic and numeric characters;

  • WHITESPACE: whitespace

  • UNDEFINED: unexpected strings such as ASCII codes that any other token cannot recognize.

The following three tables successively present detailed information about the basic types, the advanced types and the ANTLR symbols used by the basic rule types. These three tables help you to complete the Conversion rules table in the Basic settings of this component.

For basic rule types:

Basic Rule Type

Usage

Example

Conditions of rule composition

Enumeration

A rule of this type provides a list of possible matches.

RuleName: LengthUnit

RuleValue: " 'inch' | 'cm' "

- Each option must be put in a pair of single quotation marks unless this option is a pre-defined element.

- Defined options must be separated by the | symbol.

Format

(Rule name starts with upper case)

A rule of this type uses the pre-defined elements along with any of user-defined Enumeration, Format or Combination rules to define the composition of a string.

RuleName: Length

RuleValue: "DECIMAL WHITESPACE LengthUnit"

This rule means that a whitespace between decimal and lengthunit is required, so it matches strings like, 1.4 cm but does not match a string like 1.4cm. To match both of these cases, you need to define this rule as, for example, "DECIMAL WHITESPACE* LengthUnit" .

LengthUnit is an Enumeration rule defining " 'inch' | 'cm' ".

- When the name of a Format rule starts with upper case, this rule requires the exact matching result. It means that you need to define exactly any single element of a string, even a whitespace.

Format (Rule name starts with lower case)

A rule of this type is almost the same as a Format rule starting its name with upper case. The difference is that the Format rule with lower-case initial does not require exact match.

RuleName: length

RuleValue: "DECIMAL LengthUnit"

The rule matches strings like 1.4 cm or 1.4cm etc. where the Decimal is one of the pre-defined element types and LengthUnit is an Enumeration rule defining " 'inch' | 'cm' ".

n/a

Combination

A rule of this type is used when you need to create several rules of the same name.

RuleName: Size (or size)

RuleValue: "length BY length"

The rule matches strings like 1.4 cm by 1.4 cm, where length is a Format rule (starting with lower case) and BY is an Enumeration rule defining " 'By' | 'by' | 'x' | 'X' ".

- Literal texts or characters are not accepted as a part of the rule value. When the literal texts or characters are needed, you must create an Enumeration rule to define these texts or characters and then use this Enumeration rule instead.

- When several Combination rules use the identical rule name, they are executed in top-down order in the Conversion rules table of the Basic settings of tStandardizeRow, so arrange them properly in order to obtain the best result. For an example, see the following scenario.

Warning:

Any characters or string literals, if accepted by a rule type, must be put in single quotation marks when used, otherwise they will be treated as ANTLR grammar symbols or variables and generate errors or unexpected results at runtime.

For advanced rule types:

Advanced Rule Type

Usage

Example

Conditions

Regex

A rule of this type uses regular expressions to match the incoming data tokenized by ANTLR.

RuleName: ZipCode

RuleValue: "\\d{5}"

The rule matches strings like "92150"

Regular expressions must be Java compliant.

Index

A rule of this type uses a synonym index as reference to search for the matched incoming data.

For further information about available synonym indexes, see the appendix about data synonym dictionaries in the Talend Studio User Guide.

A scenario is available in Standardizing addresses from unstructured data.

- In Windows, the backslashes \ need to be doubled or replaced by slashes / if the path is copied from the file system.

- Before the full path to the index, you need enter the protocol: file://, even if you run the Job in local mode, or hdfs:// if the index is on a cluster.

- When processing a record, a given Index rule matches up only the first string identified as matchable.

- In a Talend Map/Reduce Job, you need to compress each synonym index to be used as a zip file; moreover, if you use Talend Oozie scheduler to run that Job, you have to place the zip file in the Hadoop distribution where the Job is run.

Shape

A rule of this type uses pre-defined elements along with the established Regex or Index rules or both to match the incoming data.

RuleName: Address

RuleValue: "<INT><WORD><StreetType>"

This rule matches the addresses like 12 main street, where INT and WORD are pre-defined tokens (rule elements) and StreetType is an Index rule which you define along with this example rule in the Basic settings view of this component.

For further information about the Shape rule type, see Standardizing addresses from unstructured data.

Only the contents put in < > are recognizable. In the other cases, the contents are considered as error or are omitted.

For the given ANTLR symbols:

Symbols

Meaning

|

alternative

's'

char or string literal

+

1 or more

*

0 or more

?

optional or semantic predicate

~

match not

Examples of using these symbols are presented in the following scenarios, but you can also find more examples on the following site:

https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+Cheat+Sheet.