Normalizing data using rules of basic types - 6.5

Standardization

author
Talend Documentation Team
EnrichVersion
6.5
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance > Third-party systems > Data Quality components > Standardization components
Data Quality and Preparation > Third-party systems > Data Quality components > Standardization components
Design and Development > Third-party systems > Data Quality components > Standardization components
EnrichPlatform
Talend Studio

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

For more technologies supported by Talend, see Talend components.

In this scenario, two steps are performed to:

  1. normalize the incoming data (separate the compliant data from the non-compliant data) and,

  2. extract the data of interests and display it.

Before replicating these two steps, we need to analyze the source data in order to figure out what rules need to be composed. For this scenario, the source data is stored in a .csv file called partsmaster.

There are totally 59 rows of raw data, but some of them are not shown in our capture.

Through observation, you can expect that the third row will not be recognized as it contains Oriental characters. Furthermore, you can figure out that:

  • the SKU data contains 34-9923, XC-3211 and pb710125 and so on. So the rule used to parse the SKU data could be:

    Name

    Type

    Value

    "SKU"

    "Format"

    "(DIGIT DIGIT|LETTER LETTER) '-'? DIGIT DIGIT DIGIT DIGIT (DIGIT DIGIT?)? "

  • for the Size data, the correct format is the multiplication of two or three lengths plus the length units. Therefore, the rules used to parse the Size data could be:

    Name

    Type

    Value

    "LengthUnit"

    "Enumeration"

    " 'm' | '\'' | 'inch' | 'inches' | '\"'"

    "BY"

    "Enumeration"

    "'X' | 'x' | 'by' "

    "Length"

    "Format"

    "(INT | FRACTION | DECIMAL) WHITESPACE* LengthUnit "

    "Size"

    "Combination"

    "Length BY Length BY Length"

    "Size"

    "Combination"

    "Length BY Length"

Two Combination rules use the same name, in which case, they will be executed in top-down order as is presented in this table.

  • for the Weight data, the correct format is the weight plus the weight unit. Therefore, the rules used to parse the Weight data are:

    Name

    Type

    Value

    "WeightUnit"

    "Enumeration"

    " 'lb' | 'lbs' | 'pounds' | 'Kg' | 'pinds'"

    "Weight"

    "Format"

    "(INT | FRACTION | DECIMAL) WHITESPACE* WeightUnit "

Now, you can begin to replicate the two steps of this scenario.