Defining parsing rules to standardize data - 7.1

Standardization

author
Talend Documentation Team
EnrichVersion
Cloud
7.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Standardization components
Data Quality and Preparation > Third-party systems > Data Quality components > Standardization components
Design and Development > Third-party systems > Data Quality components > Standardization components
EnrichPlatform
Talend Studio

Procedure

  1. Double-click tStandardizeRow to display the component Basic settings view.
    This component helps you to define the rules necessary to standardize the unstructured input flow and generates the brand, range, color and unit in XML format.
  2. From the Column to parse list, select Long_Description.
  3. Select the Standardize this field check box.
  4. Define your rules as the following:
    1. In the Conversion rules table, click on the [+] button to add the columns necessary to define the rules.

      This scenario focuses on the rules of the type Index. For detail examples about the other rule types defined in the capture above, please refer to the other tStandardizeRow scenarios.

    2. Define three rules as Brand, Range and Color.
    3. From the Type list, select Index and fill in the Value field with the context variable of the indexes you generated.
      For further informastion about how to create and use context variables, see Talend Studio User Guide.
    4. From the Search mode list, select Match exact. Search modes are only applicable to the Index rules.

      Using the Match exact mode, you will extract from the input flow only the strings that exactly match the brand, range and color index strings you generated with the tSynonymOutput component. For further information about available search modes, see Search modes for Index rules

  5. Click the Generate parser code in Routines button in order to generate the code under the Routines folder in the DQ Repository tree view of the Profiling perspective.
    This step is mandatory, otherwise the Job will not be executed.
  6. In the Advanced settings view, leave the options selected by default in the Output format area as they are.
    The Max edits for fuzzy match is set to 1 by default.
  7. Double-click tLogRow and define the component settings in the Basic settings view.
  8. In the Mode area, select the Table (print values in cells of a table) option.
    This component displays the tokens from the input flow that could not be analysed and matched to any of the index strings.