tStandardizeRow - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tStandardizeRow tokenizes the data flow it has received from the preceding component and applies user-defined parser rules to analyze the data. Based on this analysis, this component normalizes and writes analyzed data in a separate data flow and tags them using the user-defined rule names.

The standardization option adds a supplementary column to the output flow where the normalized data are then standardized.

The Java library ANTLR is used to parse and tokenize the incoming data. For further information about ANTLR, see the site

http://www.antlr.org/

Note

This component does not make any changes on your raw data.

Purpose

tStandardizeRow normalizes the incoming data in a separate XML or JSON data flow. This allows you to separate, even standardize if need be, the rule-compliant data from the non-compliant data.

Depending on the Talend solution you are using, this component can be used in one, some or all of the following Job frameworks:

Rule types

Two groups of rule types are provided: the basic rule types and the advanced rule types.

  • Basic rule types: Enumeration, Format and Combination. Rules of these types are composed with some given ANTLR symbols.

  • Advanced rule types: Regex, Index and Shape. Rules of these types match the tokenized data and standardize them when needed.

The advanced rule types are always executed after the ANTLR specific rules regardless of rule order. For further information about basic and advanced rules, see Different rule types for different parsing levels and Scenario 4: Using two parsing levels to extract information from unstructured data.

To create the rules of any type, Talend provides the pre-defined and case-sensitive elements (ANTLR tokens) as follows for defining the composition of a string to be matched:

  • INT: integer;

  • WORD: word;

  • WORD+: literals of several words;

  • CAPWORD: capitalized word;

  • DECIMAL: decimal float;

  • FRACTION: fraction float;

  • CURRENCY: currencies;

  • ROMAN_NUMERAL: Roman numerals;

  • ALPHANUM: combination of alphabetic and numeric characters;

  • WHITESPACE: whitespace

  • UNDEFINED: unexpected strings such as ASCII codes that any other token cannot recognize.

The following three tables successively present detailed information about the basic types, the advanced types and the ANTLR symbols used by the basic rule types. These three tables help you to complete the Conversion rules table in the Basic settings of this component.

For basic rule types:

Basic Rule Type

Usage

Example

Conditions of rule composition

Enumeration

A rule of this type provides a list of possible matches.

RuleName: LengthUnit

RuleValue: " 'inch' | 'cm' "

- Each option must be put in a pair of single quotation marks unless this option is a pre-defined element.

- Defined options must be separated by the | symbol.

Format

(Rule name starts with upper case)

A rule of this type uses the pre-defined elements along with any of user-defined Enumeration, Format or Combination rules to define the composition of a string.

RuleName: Length

RuleValue: "DECIMAL WHITESPACE LengthUnit"

This rule means that a whitespace between decimal and lengthunit is required, so it matches strings like, 1.4 cm but does not match a string like 1.4cm. To match both of these cases, you need to define this rule as, for example, "DECIMAL WHITESPACE* LengthUnit" .

LengthUnit is an Enumeration rule defining " 'inch' | 'cm' ".

- When the name of a Format rule starts with upper case, this rule requires the exact matching result. It means that you need to define exactly any single element of a string, even a whitespace.

Format (Rule name starts with lower case)

A rule of this type is almost the same as a Format rule starting its name with upper case. The difference is that the Format rule with lower-case initial does not require exact match.

RuleName: length

RuleValue: "DECIMAL LengthUnit"

The rule matches strings like 1.4 cm or 1.4cm etc. where the Decimal is one of the pre-defined element types and LengthUnit is an Enumeration rule defining " 'inch' | 'cm' ".

n/a

Combination

A rule of this type is used when you need to create several rules of the same name.

RuleName: Size (or size)

RuleValue: "length BY length"

The rule matches strings like 1.4 cm by 1.4 cm, where length is a Format rule (starting with lower case) and BY is an Enumeration rule defining " 'By' | 'by' | 'x' | 'X' ".

- Literal texts or characters are not accepted as a part of the rule value. When the literal texts or characters are needed, you must create an Enumeration rule to define these texts or characters and then use this Enumeration rule instead.

- When several Combination rules use the identical rule name, they are executed in top-down order in the Conversion rules table of the Basic settings of tStandardizeRow, so arrange them properly in order to obtain the best result. For an example, see the following scenario.

Warning

Any characters or string literals, if accepted by a rule type, must be put in single quotation marks when used, otherwise they will be treated as ANTLR grammar symbols or variables and generate errors or unexpected results at runtime.

For advanced rule types:

Advanced Rule Type

Usage

Example

Conditions

Regex

A rule of this type uses regular expressions to match the incoming data tokenized by ANTLR.

RuleName: ZipCode

RuleValue: "\\d{5}"

The rule matches strings like "92150"

Regular expressions must be Java compliant.

Index

A rule of this type uses a synonym index as reference to search for the matched incoming data.

For further information about available synonym indexes, see the appendix about data synonym dictionaries in the Talend Studio User Guide.

A scenario is available in Scenario 2: Standardizing addresses from unstructured data.

- In Windows, the backslashes \ need to be doubled or replaced by slashes / if the path is copied from the file system.

- Before the full path to the index, you need enter the protocol: file://, even if you run the Job in local mode, or hdfs:// if the index is on a cluster.

- When processing a record, a given Index rule matches up only the first string identified as matchable.

- In a Talend Map/Reduce Job, you need to compress each synonym index to be used as a zip file; moreover, if you use Talend Oozie scheduler to run that Job, you have to place the zip file in the Hadoop distribution where the Job is run.

Shape

A rule of this type uses pre-defined elements along with the established Regex or Index rules or both to match the incoming data.

RuleName: Address

RuleValue: "<INT><WORD><StreetType>"

This rule matches the addresses like 12 main street, where INT and WORD are pre-defined tokens (rule elements) and StreetType is an Index rule which you define along with this example rule in the Basic settings view of this component.

For further information about the Shape rule type, see Scenario 2: Standardizing addresses from unstructured data.

Only the contents put in < > are recognizable. In the other cases, the contents are considered as error or are omitted.

For the given ANTLR symbols:

Symbols

Meaning

|

alternative

's'

char or string literal

+

1 or more

*

0 or more

?

optional or semantic predicate

~

match not

Examples of using these symbols are presented in the following scenarios, but you can also find more examples on the following site:

https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+Cheat+Sheet.

Search modes for Index rules

One type of the advanced rules used by the tStandardizeRow component is Index rules. Index rules use synonym indexes as a reference to search for match data.

Using an Index rule without having the possibility to specify what type of match (exact, partial, fuzzy, etc.) you want to use on the input flow will not standardize and output the data you expect. tStandardizeRow allows you to select one of the following search modes for each Index rule you define in the component:

Search mode

Description

Match all

each word of the input string must exist in the index string, but the index string may contain other words too.

Match all fuzzy

each word of the input string must match similar words of the index string.

Match any

the input string should have at least one word that matches a word of the index string.

Match any fuzzy

the input string should have at least one word that is similar to a word in the index string.

Match exact

the exact input string should match the exact index string.

Match partial

each word of the input string must exist in the index string but the input string may contain other words too up to a given limit, 1 by default. This means that one word of the input string may not match to any word of the index string

Suppose, for example, that you have the below record in the input flow:

DULUX PAINTPOD EXTRA REACH HANDLE

And you have created a color index that has the Extra Deep Base string.

If you define an Index rule in tStandardizeRow and set the search mode to Match any, the component will return Extra Deep Base as a color for the above record because there is the Extra word that matches the index string. But if you want the component to only return a match when the exact search string is found in the index, you set the search mode of the rule to Match exact and the component will not return a color for the record.

For a Job example, see Scenario 3: Extracting exact match by using Index rules.

Different rule types for different parsing levels

The tStandardizeRow component uses basic rules based on ANTLR grammar and advanced rules defined by Talend and not based on ANTLR.

Sometimes, using ANTLR rules can not answer all your expectations when normalizing and standardizing data. Suppose, for example, that you want to extract the liquid amount in the following three records:

3M PROJECT LAMP 7 LUMENS 32ML
A 5 LUMINES 5 LOW VANILLA 5L 5LIGHT 5 L DULUX L
54MLP FAC 32 ML

You may start by defining a liquid unit and a liquid amount in basic parser rules as the following:

If you test these rules in the Profiling perspective of studio, you can see that these rules extract 7 L from 7 LUMENS and this is not what you expect. You do not want that the word LUMENS is split into two tokens.

The basic rules you have defined above are ANTLR lexer rules and lexer rules are used for tokenizing the input string. ANTLR does not provide a word boundary symbol like the \b used in regular expressions. You must then be careful when choosing lexer rules because they define how the input strings will be split in tokens.

You can solve such a problem using two approaches:

The first approach is to define another basic rule that matches a word with a numeric value in front of it, the Amount rule in this example:

This basic rule is a lexer rule, a Format rule that starts with an uppercase. If you test this rule in the Profiling perspective of the Studio, you can see that non liquid amounts are matched by this rule and the LiquidAmount rule only matches the expected sequence of characters.

The second approach is to use an advanced rule like a regular expression and define a word boundary with \b. You can use a lexer rule to tokenize amounts where you match any word with a numeric in front of it. Then use a regular expression that matches liquid amounts as the following: a digit optionally followed by space and followed by L or ML and terminated by a word boundary.

Note that the regular expression will be applied on the tokens created by the basic lexer rule.

You can not check the results of the advanced rule by testing the rule in the Profiling perspective of the Studio as you do with basic rules. The only means to see the results of advanced rules is by using them in a Job. The results will look as the following:

3M PROJECT LAMP 7 LUMENS 32ML
<record>
	<Amount>3M</Amount> 
	<Amount>7 LUMENS</Amount>
	<LiquidAmount>32ML</LiquidAmount> 
	<UNMATCHED> 
		<CAPWORD>PROJECT</CAPWORD> 
		<CAPWORD>LAMP</CAPWORD> 
	</UNMATCHED> 
</record>

For a Job example about the use of the above rules, see Scenario 4: Using two parsing levels to extract information from unstructured data.

Comparing these two approaches, the first one uses only ANTLR grammar and may be more efficient than the second solution which requires a second parsing pass to check each token against the regular expression. But regular expressions can help people experienced in regular expressions to create more advanced rules that could hardly be created using ANTLR only.

tStandardizeRow properties

Component family

Data quality

 

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

  

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

Column to parse

Select the column to be parsed from the received data flow

 

Standardize this field

Select this check box to standardize the rule-compliant data identified, that is, to replace the duplicates of the identified data with the corresponding standardized data from a given index.

For further information about this index providing standardized data, see tSynonymOutput.

Every time you select or clear this check box, the schema of this component is changed automatically, so in a given Job, you need to click the activated Sync columns button to fix the inconsistencies in the schema.

 

Generate parser code in Routines

Click this button to enable the data parser of your Studio to recognize the rules defined in the Conversion rules table.

In a given Job, when a rule is created, this operation is required for the execution of this rule, while if it is on an existing rule that you have modified, this operation is required only when the modified rule is of type Enumeration, Format or Combination. For further information about all of the rule types, see Rule types.

 

and

Click the import or export button to exchange a given standardization rule set with the DQ Repository.

- When you click the export button, your studio is switched to the Profiling perspective and the Parser rule Settings view is opened on the workspace with the relative contents filled automatically . Then if need be, you can edit the exported rule set and save it to the Libraries > Rules > Parser folder in the DQ Repository tree view.

- When you click the import button, a import wizard is opened to help you import the standardization rule of interest.

For further information, see Talend Studio User Guide.

 

Conversion rules

Define the rules you need to apply as the following:

- In the Name column, type in a name of the rule you want to use. This name is used as the XML tag or the JSON attribute name and the token name to label the incoming data identified by this rule.

- In the Type column, select the type of the rule you need to apply. For further information about available rule types, see Rule types.

- In the Value column, type in the syntax of the rule.

- In the Search mode column, select a search mode from the list. The search modes can be used only with the Index rule type. For further information about available search modes, see Search modes for Index rules.

A test view is provided to help you create the parser rules of interest. For further information, see Talend Studio User Guide.

Advanced settings

Advanced options for INDEX rules

- Search UNDEFINED fields: select this check box if you want the component to search for undefined tokens in the index run results.

- Word distance for partial match (available for the Match partial mode): set the maximum number of words allowed to come inside a sequence of words that may be found in the index, default value is 1.

- Max edits for fuzzy match (Based on the Levenshtein algorithm and available for fuzzy modes): select an edit distance,1 or 2, from the list. Any terms within the edit distance from the input data are matched. With a max edit distance 2, for example, you can have up to two insertions, deletions or substitutions. The score for each match is based on the edit distance of that term.

Fuzzy match gains much in performance with Max edits for fuzzy match.

Note

Jobs migrated in the Studio from older releases run correctly, but results might be slightly different because Max edits for fuzzy match is now used in place of Minimum similarity for fuzzy match.

Output format

-XML: this option is selected by default. It outputs normalized data in XML format.

-JSON: select this option to output normalized data in JSON format.

-Pretty print: this option is selected by default. It allows you to have the output on several rows. If you clear this check box, you will have the output in one row.

tStatCatcher Statistics

Select this check box to collect log data at the component level.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is an intermediary step. It requires an input flow as well as an output.

This component generates Java code in the Routines node of the Repository tree view. This code implements the grammar defined in rules. It is automatically compiled to be used by the component.

When you modify /add/delete a rule, the code must be generated again, so that the Job takes the modifications into account and then becomes runnable.

To generate it, click the Generate parser code in Routines button.

Connections

Outgoing links (from this component to another):

Row: Main; Reject

Trigger: Run if; On Component Ok; On Component Error.

Incoming links (from one component to this one):

Row: Main; Reject

For further information regarding connections, see Talend Studio User Guide.

Limitation

n/a

Scenario 1: Normalizing data using rules of basic types

In this scenario, two steps are performed to:

  1. normalize the incoming data (separate the compliant data from the non-compliant data) and,

  2. extract the data of interests and display it.

Before replicating these two steps, we need to analyze the source data in order to figure out what rules need to be composed. For this scenario, the source data is stored in a .csv file called partsmaster.

There are totally 59 rows of raw data, but some of them are not shown in our capture.

Through observation, you can expect that the third row will not be recognized as it contains Oriental characters. Furthermore, you can figure out that:

  • the SKU data contains 34-9923, XC-3211 and pb710125 and so on. So the rule used to parse the SKU data could be:

    Name

    Type

    Value

    "SKU"

    "Format"

    "(DIGIT DIGIT|LETTER LETTER) '-'? DIGIT DIGIT DIGIT DIGIT (DIGIT DIGIT?)? "

  • for the Size data, the correct format is the multiplication of two or three lengths plus the length units. Therefore, the rules used to parse the Size data could be:

    Name

    Type

    Value

    "LengthUnit"

    "Enumeration"

    " 'm' | '\'' | 'inch' | 'inches' | '\"'"

    "BY"

    "Enumeration"

    "'X' | 'x' | 'by' "

    "Length"

    "Format"

    "(INT | FRACTION | DECIMAL) WHITESPACE* LengthUnit "

    "Size"

    "Combination"

    "Length BY Length BY Length"

    "Size"

    "Combination"

    "Length BY Length"

Two Combination rules use the same name, in which case, they will be executed in top-down order as is presented in this table.

  • for the Weight data, the correct format is the weight plus the weight unit. Therefore, the rules used to parse the Weight data are:

    Name

    Type

    Value

    "WeightUnit"

    "Enumeration"

    " 'lb' | 'lbs' | 'pounds' | 'Kg' | 'pinds'"

    "Weight"

    "Format"

    "(INT | FRACTION | DECIMAL) WHITESPACE* WeightUnit "

Now, you can begin to replicate the two steps of this scenario.

Separating the compliant data and the non-compliant data

In this first step, four components are used. They are

  • tFileInputDelimited (partsmaster): this component loads and outputs the source data to the component that follows.

  • tStandardizeRow: this component normalizes the incoming data and rejects the exception data that it cannot recognize.

  • tLogRow: this component is used two times to display respectively the exception data and the output XML data.

To replicate this step, proceed as the following sections illustrate.

Dropping and linking the components

To replicate this step, proceed as follows:

  1. Set up the file delimited schema for the partsmaster.csv file in the Repository of your Studio. For further information about how to set up a file delimited schema, see Talend Studio User Guide.

  2. Under the Metadata node of the Repository of your Studio, select the file delimited schema for partsmaster.csv and drop it on the design workspace.

  3. From the dialog box that pops up, select tFileInputDelimited.

    Then the tFileInputDelimited (partsmaster) component displays in the design workspace.

  4. From the Palette, drop the tStandardizeRow component and two tLogRow components into the design workspace.

  5. Right-click tFileInputDelimited (partsmaster) to open the contextual menu.

  6. From the contextual menu, select Row > Main link to connect this component to tStandardizeRow.

  7. Accept the schema propagation prompted by the pop-up dialog box.

  8. From tStandardizeRow, do the same to connect this component to one of the two tLogRow components using Row > Main link and to the other using Row > Reject link and accept the schema propagation.

    You can change the name displayed for each of these component as what has been done for one of the tLogRow component, named as Exception in this scenario. For further information, see Talend Studio User Guide.

Then you can continue to configure each component to be used.

Configuring the process of normalizing rows

To do this, proceed as follows:

  1. Double-click tStandardizeRow to open its Component view.

  2. In the Column to parse field, select SKU_Description_Size_Weight. This is the only column that the incoming schema has.

  3. Under the Conversion rules table, click the plus button eight times to add eight rows in this table.

  4. To complete these rows, type in the rules you have figured out when analyzing the raw data at the beginning of this scenario.

    The two Size rules are executed in top-down order. In this example, this order allows this component to match firstly the sizes with three numbers and then those with two numbers. If you reverse this order, this component will match the first two numbers of all sizes before all and then treat the last number of the three-numbers sizes as unmatched.

  5. Click the Generate parser code in routines button.

  6. In the Advanced settings view, leave the options selected by default in the Output format area as they are.

    The Max edits for fuzzy match is set to 1 by default.

Executing the normalization Job

Press F6 to run this Job.

In the console of the Run view, you can read the output data structure. Each instance of the XML data is written on a separate row because the Pretty print check box is selected in the Advanced settings view of the tStandardizeRow component.

Corresponding to the raw data, such as, 34-9923, Monolithic Membrance, 4' x 8', 26 lbs, an XML segment is generated where the compliant data are separated from the noncompliant data (unmatched). Thus the raw data are normalized.

Again from this console, you can read that a row is rejected as it contains unrecognizable data.

Extracting data of interest from the normalized XML data

This section presents the second step of this scenario. Four components are added to receive the normalized XML data in place of the tLogRow component which was set for this action in the first step. The four components are:

  • tExtractXMLField: this component reads an input XML field and extracts desired data;

  • tFilterRow: this component filters input rows by setting conditions on the selected columns;

  • tLogRow (two times): the two components receive and display respectively the valid or invalid data against conditions set in tFilterRow.

To replicate the second step, proceed as the following sections illustrate.

Dropping and linking the additional components

To do this, proceed as follows:

  1. On the design workspace, delete the tLogRow component that is connected to tStandardizeRow with Main row link.

  2. From the Palette, drop tExtractXMLField, tFilterRow and two tLogRow components into the design workspace. In this scenario, the two tLogRow components are named respectively ValidSize and InvalidSize.

    For further information about how to rename a component, see Talend Studio User Guide.

  3. Connect tExtractXMLField to tFilterRow with Main link as you did in the first step to connect partsmaster to tStandardizeRow.

  4. Do the same but with Filter link to connect tFilterRow to ValidSize.

  5. Connect tFilterRow to InvalidSize with Reject link as you did in the first step to connect tSandardizeRow to Exception.

Configuring the process of extracting the XML fields of interest

To do this, proceed as follows

  1. Double click tExtractXMLField to open its Component view.

  2. Next to Edit schema, click the three-dot button to open the schema editor.