Scenario 1: Normalizing data using rules of basic types - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

In this scenario, two steps are performed to:

  1. normalize the incoming data (separate the compliant data from the non-compliant data) and,

  2. extract the data of interests and display it.

Before replicating these two steps, we need to analyze the source data in order to figure out what rules need to be composed. For this scenario, the source data is stored in a .csv file called partsmaster.

There are totally 59 rows of raw data, but some of them are not shown in our capture.

Through observation, you can expect that the third row will not be recognized as it contains Oriental characters. Furthermore, you can figure out that:

  • the SKU data contains 34-9923, XC-3211 and pb710125 and so on. So the rule used to parse the SKU data could be:

    Name

    Type

    Value

    "SKU"

    "Format"

    "(DIGIT DIGIT|LETTER LETTER) '-'? DIGIT DIGIT DIGIT DIGIT (DIGIT DIGIT?)? "

  • for the Size data, the correct format is the multiplication of two or three lengths plus the length units. Therefore, the rules used to parse the Size data could be:

    Name

    Type

    Value

    "LengthUnit"

    "Enumeration"

    " 'm' | '\'' | 'inch' | 'inches' | '\"'"

    "BY"

    "Enumeration"

    "'X' | 'x' | 'by' "

    "Length"

    "Format"

    "(INT | FRACTION | DECIMAL) WHITESPACE* LengthUnit "

    "Size"

    "Combination"

    "Length BY Length BY Length"

    "Size"

    "Combination"

    "Length BY Length"

Two Combination rules use the same name, in which case, they will be executed in top-down order as is presented in this table.

  • for the Weight data, the correct format is the weight plus the weight unit. Therefore, the rules used to parse the Weight data are:

    Name

    Type

    Value

    "WeightUnit"

    "Enumeration"

    " 'lb' | 'lbs' | 'pounds' | 'Kg' | 'pinds'"

    "Weight"

    "Format"

    "(INT | FRACTION | DECIMAL) WHITESPACE* WeightUnit "

Now, you can begin to replicate the two steps of this scenario.

Separating the compliant data and the non-compliant data

In this first step, four components are used. They are

  • tFileInputDelimited (partsmaster): this component loads and outputs the source data to the component that follows.

  • tStandardizeRow: this component normalizes the incoming data and rejects the exception data that it cannot recognize.

  • tLogRow: this component is used two times to display respectively the exception data and the output XML data.

To replicate this step, proceed as the following sections illustrate.

Dropping and linking the components

To replicate this step, proceed as follows:

  1. Set up the file delimited schema for the partsmaster.csv file in the Repository of your Studio. For further information about how to set up a file delimited schema, see Talend Studio User Guide.

  2. Under the Metadata node of the Repository of your Studio, select the file delimited schema for partsmaster.csv and drop it on the design workspace.

  3. From the dialog box that pops up, select tFileInputDelimited.

    Then the tFileInputDelimited (partsmaster) component displays in the design workspace.

  4. From the Palette, drop the tStandardizeRow component and two tLogRow components into the design workspace.

  5. Right-click tFileInputDelimited (partsmaster) to open the contextual menu.

  6. From the contextual menu, select Row > Main link to connect this component to tStandardizeRow.

  7. Accept the schema propagation prompted by the pop-up dialog box.

  8. From tStandardizeRow, do the same to connect this component to one of the two tLogRow components using Row > Main link and to the other using Row > Reject link and accept the schema propagation.

    You can change the name displayed for each of these component as what has been done for one of the tLogRow component, named as Exception in this scenario. For further information, see Talend Studio User Guide.

Then you can continue to configure each component to be used.

Configuring the process of normalizing rows

To do this, proceed as follows:

  1. Double-click tStandardizeRow to open its Component view.

  2. In the Column to parse field, select SKU_Description_Size_Weight. This is the only column that the incoming schema has.

  3. Under the Conversion rules table, click the plus button eight times to add eight rows in this table.

  4. To complete these rows, type in the rules you have figured out when analyzing the raw data at the beginning of this scenario.

    The two Size rules are executed in top-down order. In this example, this order allows this component to match firstly the sizes with three numbers and then those with two numbers. If you reverse this order, this component will match the first two numbers of all sizes before all and then treat the last number of the three-numbers sizes as unmatched.

  5. Click the Generate parser code in routines button.

  6. In the Advanced settings view, leave the options selected by default in the Output format area as they are.

    The Max edits for fuzzy match is set to 1 by default.

Executing the normalization Job

Press F6 to run this Job.

In the console of the Run view, you can read the output data structure. Each instance of the XML data is written on a separate row because the Pretty print check box is selected in the Advanced settings view of the tStandardizeRow component.

Corresponding to the raw data, such as, 34-9923, Monolithic Membrance, 4' x 8', 26 lbs, an XML segment is generated where the compliant data are separated from the noncompliant data (unmatched). Thus the raw data are normalized.

Again from this console, you can read that a row is rejected as it contains unrecognizable data.

Extracting data of interest from the normalized XML data

This section presents the second step of this scenario. Four components are added to receive the normalized XML data in place of the tLogRow component which was set for this action in the first step. The four components are:

  • tExtractXMLField: this component reads an input XML field and extracts desired data;

  • tFilterRow: this component filters input rows by setting conditions on the selected columns;

  • tLogRow (two times): the two components receive and display respectively the valid or invalid data against conditions set in tFilterRow.

To replicate the second step, proceed as the following sections illustrate.

Dropping and linking the additional components

To do this, proceed as follows:

  1. On the design workspace, delete the tLogRow component that is connected to tStandardizeRow with Main row link.

  2. From the Palette, drop tExtractXMLField, tFilterRow and two tLogRow components into the design workspace. In this scenario, the two tLogRow components are named respectively ValidSize and InvalidSize.

    For further information about how to rename a component, see Talend Studio User Guide.

  3. Connect tExtractXMLField to tFilterRow with Main link as you did in the first step to connect partsmaster to tStandardizeRow.

  4. Do the same but with Filter link to connect tFilterRow to ValidSize.

  5. Connect tFilterRow to InvalidSize with Reject link as you did in the first step to connect tSandardizeRow to Exception.

Configuring the process of extracting the XML fields of interest

To do this, proceed as follows

  1. Double click tExtractXMLField to open its Component view.

  2. Next to Edit schema, click the three-dot button to open the schema editor.

  3. On the left side of this editor, select the SKU_Description_Size_Weight row.

  4. Click the button to add this row to the right side of the editor.

  5. On the right side of this editor, click the plus button on the bottom toolbar three times to add three rows.

  6. Rename the three rows as SKU, Size, Weight, all of which are tags of the normalized XML data.

  7. Click OK to validate this editing and accept the schema propagation prompted by the pop-up dialog box. Four rows are automatically filled in the Mapping table on the Component view.

  8. In the XML field field of the Component view, select the desired XML field of the incoming data. In this example, it is Normalized_field.

  9. In the Loop XPath Query field, type in the name of the XML Tree root tag to replace the default value. In this example, type in "/record".

  10. In the XPath query column of the Mapping table, type in the XML Tree tags corresponding to desired normalized data between quotation marks. In this example, they are "SKU", "Size", "Weight".

Configuring the process of filtering the rows of interest

To do this, proceed as follows:

  1. Double click tFilterRow to open its Component view.

  2. Under the Conditions table, click the plus button to add one row.

  3. In the InputColumn column of this row, select Size from the drop-down list.

  4. In the Function column, select Empty.

  5. In the Operator column, select Not equal to.

  6. In the Value column, type in the quotation marks "".

The condition is set up. The normalized data which contain no null in the Size tag will be output while those which contain null in this Size tag will be rejected.

Executing the Job
  • Press F6 to run this Job.

    In the console of the Run view, you can read the desired data selected from the normalized XML data flow.

    This Job lists the normalized data of interest in parallel with the raw data.

    You can read the data rejected by tFilterRow as well.

    Note

    For reason of page space, all of the results are not displayed.