Testing a set of parser rules

Talend Platform for Enterprise Integration Studio User Guide

EnrichVersion
5.6
EnrichProdName
Talend Platform for Enterprise Integration
task
Design and Development
Data Quality and Preparation
EnrichPlatform
Talend Studio

After creating or importing a set of rules and before employing it in the real use case, you may need to test this parser rule set for improvement and validation.

The Profiling perspective of the studio provides a comprehensive test tool combining the total test process in one single view to facilitate the test.

From this view, you can perform the following operations:

  • Select the rule you need to test.

  • Create a data sample against which the test is performed.

  • Edit the rules if need be.

  • Analyze and discover the elements used to compose a rule against the sample data.

  • Save and reuse the sample of interest.

  • Save the test result of interest.

  • Improve the tested rules.

  • Create a new set of rules after necessary improvements.

How to access the rule test view

Prerequisite(s):You have opened the rule set you want to test in the Parser Rule Settings editor.

  • Click the rule test button under the Parser Rules table in the Parser Rule view.

    The test view is displayed with the set of parser rules entered automatically.

    Note

    If any of the parser rules to be tested does not comply with the ANTLR parser grammar, an error message is displayed and the test view does not display.

    The figure below is one example of the test view:

    Note

    The test process is dedicated to testing the rules of type Enumeration, Format or Combination using ANTLR grammar. For further information about the rule types and about how to understand the rules used in this example, see Talend Components Reference Guide.

The Interpreter tab and the Grammar tab are available at the bottom of this test view providing the access to their corresponding views. The Grammar view is read-only and allows you to check the ANTLR grammar that your parser rules are using. You need to use the Interpreter view to perform any testing operation.

In the Interpreter view, you can see the followings:

An element list.

This list presents all of the rule elements available to a rule set but every element is not necessarily used by the set of rules. These elements contain the pre-defined ANTLR elements and the user-defined elements, with the latter being generated automatically from the names of the rules to be tested. In this example, the length element, the weight element and the SKU element, to name a few, are all user defined with their corresponding rules in the rule set to be tested.

You may notice that a lower-case element has often an equivalent upper-case element listed in this area. This is mainly because that the ANTLR's parser requires the lower case and the ANTLR's lexer requires the upper case. For further information about ANTLR's lexer and parser rules, go to:

https://theantlrguy.atlassian.net/wiki/display/ANTLR3/Quick+Starter+on+Parser+Grammars+-+No+Past+Experience+Required .

However, as the upper-case Format rule requires exact match and the lower-case Format rule does not, so when you name a Format rule using upper-case letters, the equivalent lower-case element is generated while the reverse is not true.

For further information about how to use the ANTLR elements pre-defined within Talend Studio, see the tStandardizeRow component in the Talend Components Reference Guide; for further information, check the ANTLR's website.

Each of the element is treated as a unit you can test. The rule element at the beginning of the list represents the whole parser rule set, so to test all of the rules contained in a set, you need to use this element.

The top Rule field is a filter tool where you can type in the name of the element you need to test.

Note

The pre-defined elements are not all displayed in this example; when you have created your own parser rules, this list may be different from the screenshot.

A data sample.

In this box, enter the data sample against which you test the set of rules of interest. Each sample is supposed to be representative of one of the data variants you need to standardize using the rules to be tested.

To run a test, click the button.

To clear up the existing sample data for entering new data, click the button.

To save the current sample data from this area, click the button and type in a name for this sample in the [Save test case] dialog box.

A rule set.

This table is filled automatically with the rules to be tested. You can edit these rules in this area according to the analysis of the test result.

The provided toolbar is similar to the one equipped with the Parser Rules table for modifying rules in the Parser rule settings editor. For further information, see Modifying an established parser rule set

Once the rule set is improved, you may need to save it or create another rule set from the improved one.

To save it:

  • Click the button in this area.

    Note

    Each time when you click this button to save a set of rules, the parser code and the test view are automatically refreshed and thus the data sample area and the test result area are emptied. So if need be, it is recommended to save the current data sample and the test result prior to clicking this button.

To create another rule set:

  • Click the Create Rule button and enter the related information in the [New Parser Rule] dialog box. Once done, the Parser rule settings editor is displayed automatically.

A test-case list.

This area lists every saved sample data (test case). You can click it when you need to reuse it.

The Test field is a filtering tool provided for finding the sample data of interest from the list.

The graphic view of the test result.

Note

In the diagram presented in this figure, the basic node represents the rule type of the pre-defined elements. The word and the integer node are among the pre-defined ANTLR elements. For further information, see the tStandardizeRow component in the Talend Components Reference Guide.

This area presents a diagram of the data-element relations. For example, the sku element corresponds to 34-9923 and the integer element corresponds to 6125.

The diagram is generated once a test is done. You can read the map between each unit of the sample data and the corresponding element from this diagram.

When an element does not have the corresponding unit in the sample data, the related error is listed in the Problems view of this area.

Note

The test uses all of the available elements from the element list area to match the units in the data sample. However, the name of the Combination rule and the name of the upper-case Format rule do not display in this diagram.

How to test a rule set

This section uses an example to present the details about how to test a set of parser rules.

In this example, the rules to be tested are as follows:

Name

Type

Value

"SKU"

"Format"

"(DIGIT DIGIT|LETTER LETTER) '-'? DIGIT DIGIT DIGIT DIGIT (DIGIT DIGIT?)? "

"LengthUnit"

"Enumeration"

" 'm' | '\'' | 'inch' | 'inches' | '\"'"

"by"

"Enumeration"

"'X' | 'x' | 'by' "

"length"

"Format"

"(INT | FRACTION | DECIMAL) LengthUnit "

"Size"

"Combination"

"length by length"

"WeightUnit"

"Enumeration"

" 'lb' | 'lbs' | 'pounds' | 'Kg' | 'pinds'"

"weight"

"Format"

"(INT | FRACTION | DECIMAL) WeightUnit "

Prerequisite(s):You must know how to create a set of parser rules and how to access the corresponding test view in the your studio main window. For further information, see Creating a set of parser rules and How to access the rule test view.

Note

If you need to import sample rules, you can do this using the tStandardizeRow component in an existing Job, like the products_parsing Job in the standardization_examples > product directory provided by the Data Quality Demos project in your studio. For further information, see the tStandardizeRow component in the Talend Components Reference Guide.

To replicate this example, proceed as follows:

  1. In the rule list on the upper-left corner of the Interpreter test view, click the rule element. This means that you need to test the whole set of rules.

  2. In the data sample box docked on the upper part of the test view, type in a piece of data sample.

    In this example, it is 34-9923, Monolithic Membrane 6125; four by eight sheet, 26 lbs 26 lbs. This data describes a merchandise.

  3. Click the save button in the upper-right corner of this data sample area to save this test case and type in a name in the [Save test case] dialog box, for example, SKU.

  4. Click OK.

    This test case is displayed in the test-case list on the lower-left corner. The Interpreter test view should look like the following:

  5. Click the button on the upper-right corner to run this test. Once done, the test result is displayed on the lower part of this view.

    From this result, you can easily find where you can improve the given rules. The data four by eight sheet represents a size but it is not matched up to the corresponding rule. So you can consider to add new rules or modify the existing rules. Both ways are contextual and no one is necessarily better than the other. In this example, we add an Enumeration rule and modify the length and the LengthUnit rules to improve the matching exactness.

    Name

    Type

    Value

    "length"

    "Format"

    "(INT | FRACTION | DECIMAL ) LengthUnit | Number LengthUnit?"

    "Number"

    "Enumeration"

    "'four' | 'eight' "

    "LengthUnit"

    "Enumeration"

    " 'm' | '\'' | 'inch' | 'sheet' | 'inches' | '\"' "

    The new length rule means that four or eight with or without a length unit could be matched.

    Note

    To update these rules, you have to grasp the ANTLR grammar and the ANTLR symbols used to write a rule. For further information, see Talend Components Reference Guide, and for further information, check the ANTLR's website.

  6. Click the save button beneath the rule table to refresh the test view and re-generate the parser code. The data sample area and the test result area become empty.

  7. In the test-case list on the down-left corner, select the SKU data sample you have saved earlier.

  8. Click the play button again in the upper-right corner to run this test. Once done, the new test result is displayed in the corresponding area:

    From this result, you can see that the four by eight sheet data has been matched up to the Size rule of the Combination type.

    Note

    The test view does not present the name of any Combination rules as this type allows the repetition of rule names. In the ANTLR Grammar tab view, names of the Combination rules, not always unique, are not generated as code in order to avoid duplicate errors. The following figure shows the code corresponding to this example: the name Size is always a literal value between quotation marks without its equivalent code element while the format rules SKU and length have their equivalent code elements sku and length. For further information about ANTLR grammar, see ANTLR website.

    If required, you can continue improving these rules by using more data samples. The results are always open-ended and this test view allows you to compose the rules that best fulfill your needs.