Scenario 2: Standardizing addresses from unstructured data - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

In this scenario, six components are used to standardize addresses from unstructured input data copied from a Website, by matching the data with the data contained in an index previously generated by a Job.

For more information about index creation, see tSynonymOutput.

Drop the following components from the Palette to the design workspace.

  • tFixedFlowInput: this component holds the unstructured data from which the addresses are extracted.

  • tStandardizeRow: this component defines the address rules and generates the addresses in XML format with the defined tags. This is the process of normalizing and standardizing the initially unstructured data.

  • tFilterColumns: this component filters the standardized addresses.

  • tExtractXMLField: this component extracts the attributes from the Address node of the XML tree in order to output every address item in formatted columns.

  • two tLogRow: these components are used to display the output data. The first tLogRow returns the errors, if any. The second tLogRow displays the result in the console.

Before starting up to replicate this scenario, you have to retrieve the content of an index in order to match the unstructured data with the index data. The content of the index reads as follows:

On the left side are held Paris street names and on the right side are held the synonyms used in the data. The data will be used as references to standardize the address data collected from the website of interest, for example, from http://paris.conciergerie.com/travel_paris/address.php.

To replicate this scenario, proceed as the following sections illustrate.

Preparing the unstructured data

To do this, proceed as follows:

  1. Double-click the tFixedFlowInput component to display its Basic settings view.

  2. Click the [...] button to open the [Schema] dialog box, click the [+] button to add a column, and name it input. Then, click OK to validate and close the editor.

  3. In the Mode area, select Use Inline Content (delimited file).

  4. In the Content field, paste the data copied from the Website. This data is unstructured.

Then you can continue to configure the standardization process.

Configuring the process of standardizing the unstructured data

To do this, proceed as follows:

  1. Connect the tFixedFlowInput component to the tStandardizeRow component with a Row > Main link.

  2. Double-click the tStandardizeRow component to display its Basic settings view.

    The advanced rule types are always executed after the ANTLR specific rules regardless of rule order.

  3. In the Column to parse list, select input.

  4. Select the Standardize this field check box.

  5. In the Conversion rules table, click six times on the [+] button to add six columns. Name them and configure them as shown in the capture.

    For "StreetName" rule: select the Index type in the Type field and enter the path to your index in the Value field.

    • Select the Index type from the Type list.

    • Enter the path to your index in the Value field.

    • Select Match exact as the search mode from the Search mode list.

      By using the Match exact mode, you will extract from the input flow only the strings that exactly match the street name indexes.

      For further information about available search modes, see Search modes for Index rules

    Then the other rules are:

    Name

    Type

    Value

    "Zip"

    Format

    " DIGIT DIGIT DIGIT DIGIT DIGIT "

    "City" Enumeration

    " 'Paris' | 'Paris' | 'PARIS' | 'Bagnolet' "

    "SpecialStreetNumber" Format

    " (INT ('bis'|'ter')) | ( INT '-' INT) "

    "INT" Shape

    "<SpecialStreetNumber>"

    "Address" Shape

    " <INT> <StreetName> <Zip> <City> "

    As advanced rule types are always executed after the ANTLR specific rules, the basic rules "Zip", "City" and "SpecialStreetNumber" are executed first followed by "INT" and "Address", the advanced rules.

  6. Click the Generate parser code in Routines button in order to generate the code in the Routines.

    Warning

    If you do not click on this button, the Job will not be executed.

  7. In the Advanced settings view, leave the options selected by default in the Output format area as they are.

    The Max edits for fuzzy match is set to 1 by default.

  8. In the design workspace, right-click the tStandardizeRow component to connect it to the tLogRow below with a Row > Reject link.

  9. Double-click the tLogRow component linked to the tStandardizeRow component to display its Basic settings view.

  10. In the Mode area, select the Table (print values in cells of a table) option.

Then continue to configure the process of filtering and extracting the data of interest.

Configuring the process of filtering and extracting the data of interest

To do this, proceed as follows:

  1. In the design workspace, connect tStandardizeRow to tFilterColumns with a Row > Main link.

  2. In the Basic settings view of tFilterColumns, click the [...] button next to the Edit schema field to open the schema editor.

  3. Click on the STANDARDIZED_FIELD column, click the first arrow to filter this column, and then click OK to close the editor and validate the changes.

  4. In the design workspace, connect tFilterColumns to tExtractXMLField with a Row > Main link.

  5. Double-click the tExtractXMLField component to display its Basic settings view.

  6. Click the [...] button next to the Edit schema field to open the schema editor, add four columns and name them respectively number, street, zip and city, and then click OK to validate and close the editor.

  7. In the XML field field, check that the STANDARDIZED_FIELD column is selected.

    The content of the STANDARDIZED_FIELD column is an XML field with a <record> tag as root. Under this tag, the structure defined by the rules configured in the tStandardizeRow component allows to extract each item of the address, as you can see in the following capture.

  8. In the Loop XPath query field, type in "/record/Address".

  9. In the Mapping table, in the XPath query field, enter "INT" for the number column, "StreetName" for the street column, "Zip" for the zip column and "City" for the city column.

Then you can prepare to execute the Job.

Executing the Job

Before executing the Job, you need to add the tLogRow component to display the execution result.

To do this, perform the following operations:

  1. In the design workspace, connect tExtractXMLField to the second tLogRow with a Row > Main link.

  2. Double-click the tLogRow component to open its Basic settings view.

  3. In the Schema list, select Built-In.

  4. Click the [...] button next to the Edit schema field and check that the input and the output schemas in the tLogRow are synchronized.

    If not, click on the Sync columns button.

  5. In the Mode area, select the Table (print values in cells of a table) option.

Press F6 to execute the Job.

The second tLogRow displays the normalized and standardized addresses in the console according to the index.

For example, you can see that the Champs Elysées address has been selected and formatted, and only input street names that exactly match an index string are extracted.

The first tLogRow does not return anything in this scenario because no error has been generated.