- Basic rule types: Enumeration, Format and Combination. Rules of these types are composed with some given ANTLR symbols.
- Advanced rule types: Regex, Index and Shape. Rules of these types match the tokenized data and standardize them when needed.
The advanced rule types are always executed after the ANTLR specific rules regardless of rule order. For further information about basic and advanced rules, see Different rule types for different parsing levels and Using two parsing levels to extract information from unstructured data.
- INT: integer;
- WORD: word;
- WORD+: literals of several words;
- CAPWORD: capitalized word;
- DECIMAL: decimal float;
- FRACTION: fraction float;
- CURRENCY: currencies;
- ROMAN_NUMERAL: Roman numerals;
- ALPHANUM: combination of alphabetic and numeric characters;
- WHITESPACE: whitespace
- UNDEFINED: unexpected strings such as ASCII codes that any other token cannot recognize.
The following three tables successively present detailed information about the basic types, the advanced types and the ANTLR symbols used by the basic rule types. These three tables help you to complete the Conversion rules table in the Basic settings of this component.
For basic rule types:
|Basic Rule Type||Usage||Example||
Conditions of rule composition
|Enumeration||A rule of this type provides a list of possible matches.||
|Each option must be put in a pair of single
quotation marks unless this option is a pre-defined element.
Defined options must be separated by the | symbol.
(Rule name starts with upper case)
|A rule of this type uses the pre-defined
elements along with any of user-defined Enumeration, Format or Combination rules to define the
composition of a string.
This rule means that a whitespace
between decimal and lengthunit is required, so it matches strings
like, 1.4 cm but does not
match a string like 1.4cm. To match both of these cases, you need to
define this rule as, for example,
|When the name of a Format rule starts with upper case, this rule requires the exact matching result. It means that you need to define exactly any single element of a string, even a whitespace.|
|Format (Rule name starts with lower case)||A rule of this type is almost the same as a Format rule starting its name with upper case. The difference is that the Format rule with lower-case initial does not require exact match.||
The rule matches strings like 1.4 cm or 1.4cm etc. where the
|Combination||A rule of this type is used when you need to create several rules of the same name.||
The rule matches strings like 1.4 cm by 1.4 cm, where
|Literal texts or characters are not
accepted as a part of the rule value. When the literal texts or
characters are needed, you must create an Enumeration rule to define these texts
or characters and then use this Enumeration rule instead.
When several Combination rules use the identical rule name, they are executed in top-down order in the Conversion rules table of the Basic settings of tStandardizeRow, so arrange them properly in order to obtain the best result. For an example, see the following scenario.
For advanced rule types:
|Advanced Rule Type||Usage||Example||Conditions|
|Regex||A rule of this type uses regular expressions to match the incoming data tokenized by ANTLR.||
The rule matches strings like "92150"
|Regular expressions must be Java compliant.|
|Index||A rule of this type uses a synonym index
as reference to search for the matched incoming data.
For further information about available synonym indexes, see the appendix about data synonym dictionaries in the Talend Studio User Guide.
|A scenario is available in Standardizing addresses from unstructured data.||On Windows, the backslashes
If you run the Job using Spark Local mode or if you run the Job
locally, the path to index folder must start with
When processing a record, a given Index rule matches up only the first string identified as matchable.
In a Talend Map/Reduce Job, you need to compress each synonym index to be used as a zip file.
|Shape||A rule of this type uses pre-defined elements along with the established Regex or Index rules or both to match the incoming data.||
This rule matches the addresses like 12 main street, where INT and WORD are pre-defined tokens (rule elements) and StreetType is an Index rule which you define along with this example rule in the Basic settings view of this component.
For further information about the Shape rule type, see Standardizing addresses from unstructured data.
|Only the contents put in
For the given ANTLR symbols:
||char or string literal|
||1 or more|
||0 or more|
||optional or semantic predicate|
For more information about ANTLR symbols, see: https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+Cheat+Sheet.