In this scenario, you will standardize some long descriptions of customer products by matching the input flow with the data contained in an index. This scenario explains how to use Index rules to tokenize product data and then check each token against an index to extract exact match.
For this scenario, you must first create an index by using a Job with the tSynonymOutput component. You need to create indexes for the brand, range, color and unit of the customer products. Use the tSynonymOutput component to generate the indexes and feed them with entries and synonyms. The below capture shows an example Job:
Below is a sample of the generated indexes for this scenario:
Each of the generated indexes has strings (sequences of words) in one column and their corresponding synonyms in the second column. These strings are used as a reference data against which the product data, generated by tFixedFlowInput, will be matched. For further information about index creation, see tSynonymOutput.
In this scenario, the generated indexes are defined as context variable. For further information about context variables, see Talend Studio User Guide.
Drop the following components from the Palette to the design workspace: tFixedFlowInput, tStandardizeRow, tExtractXMLField and tFileOutputExcel.
Connect the components together using Main links.
Drop tLogRow to the Job and link tStandardizeRow to it using a Reject link.
Double-click tFixedFlowInput to display the component Basic settings view.
This component will generate the unstructured product data from which the brand, range, color and unit are extracted.
Click the [...] button to open the [Schema] dialog box.
Click the [+] button and add two columns, name them as Long_Description and Supplier_Product_Code. Click OK to close the dialog box.
In the Mode area, select Use Inline Content (delimited file).
In the Content field, enter the unstructured product data you want to standardize, for example as the following:
HONEYWELL 7DAY SNGL CHNL TIMESWITCH C/W EX LARGE BACKLIT DISPLAY ST9100C1106|ST9100C1006||Data to be standardised|| HONEYWELL AQUASTAT SINGLE IMMERSION T/STAT WITH CAPILLARY 70-140C|L6188B2018U|||| HONEYWELL S PLAN ZONE VALVE CONTROL PACK TW6Z910C|TW6Z910C|||| H/WELL ST6100S1001 24HR 1CH SERVIC TIMER||||| H/WELL Y603A1133 BOXED Y PLAN ||||| HWEL V8600N2056 LOW VOLT COMBI VALVE ||||| HWEL VR4700C4022 GAS VALVE ||||| DULUX PAINTPOD 2M EXTENSION TUBE |6030950|||| DULUX PAINTPOD REPLACEMENT ROLLER SLEEVE|6030948|||| DULUX PAINTPOD EXTRA REACH HANDLE |6030951|||| DULUX PAINTPOD REPLACEMENT EDGING BRUSH |6030952|||| DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 1L|5180583|||| DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 2.5L|5180652|||| DT TRADE WEATHERSHIELD ULTIMATE OPAQUE BLACK 5L|5180744|||| DT TRADE WEATHERSHIELD ULTIMATE OPAQUE EXTRA DEEP BASE 1L|5180584||||
Double-click tStandardizeRow to display the component Basic settings view.
This component helps you to define the rules necessary to standardize the unstructured input flow and generates the brand, range, color and unit in XML format.
From the Column to parse list, select Long_Description.
Select the Standardize this field check box.
Define your rules as the following:
In the Conversion rules table, click on the [+] button to add the columns necessary to define the rules.
This scenario focuses on the rules of the type Index. For detail examples about the other rule types defined in the capture above, please refer to the other tStandardizeRow scenarios.
Define three rules as Brand, Range and Color.
From the Type list, select Index and fill in the Value field with the context variable of the indexes you generated.
For further information about how to create and use context variables, see Talend Studio User Guide.
From the Search mode list, select Match exact. Search modes are only applicable to the Index rules.
Using the Match exact mode, you will extract from the input flow only the strings that exactly match the brand, range and color index strings you generated with the tSynonymOutput component.
For further information about available search modes, see Search modes for Index rules
Click the Generate parser code in Routines button in order to generate the code under the Routines folder in the DQ Repository tree view of the Profiling perspective.
This step is mandatory, otherwise the Job will not be executed.
In the Advanced settings view, leave the options selected by default in the Output format area as they are.
The Max edits for fuzzy match is set to 1 by default.
Double-click tLogRow and define the component settings in the Basic settings view.
In the Mode area, select the Table (print values in cells of a table) option.
This component displays the tokens from the input flow that could not be analysed and matched to any of the index strings.
Double-click tExtractXMLField to display the component Basic settings view.
Click the [...] button next to the Edit schema field to open the schema editor, add four columns in the output flow and name them as Brand, Range, Color and Amount. Click OK to validate and close the editor.
In the XML field field, check that the STANDARDIZED_FIELD column is selected.
The content of the STANDARDIZED_FIELD column is an XML field with a
<record>tag as root. Under this tag, the structure defined by the rules configured in the tStandardizeRow component allows to extract items from the input flow.
In the Loop XPath query field, enter "/" to define the root as the loop node.
In XPath query field in the Mapping table, enter respectively "record/Brand", "record/Range""record/Color" and "record/Measure".
Double-click tFileOutputExcel to display the component Basic settings view.
Set the destination file name as well as the Sheet name and select the Include header check box.
Save your Job and press F6 to execute it.
The tLogRow component displays on the studio console the data the Job could not analyze:
The normalized and standardized data is written in the output excel file.
Right-click the tFileOutputExcel component and select Data viewer
A preview of the standardized data is opened in the studio.
The Brand, Range and Color rules you defined have checked each token in the input flow against the index strings and only input strings that exactly match an index string are extracted. Take for example line 11, no color has been extracted from this record as the record does not have an exact match to any of the color index strings.
Try and change in the tStandardizeRow basic settings the search mode for the Color rule to Match any, and then execute the Job.
You can see that a color has now been extracted for the record on line 11. This record has one word, "EXTRA", that matches to the color index "Extra Deep Base" and that is enough according to the Match any search mode to consider that the record has a color.
The different search modes available in the tStandardizeRow component allow you to standardize and output the data you expect.