Scenario: Parsing addresses against Loqate data - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a three-component Job that:

  • uses the tFixedFlowInput component to generate the address data to be analyzed,

  • uses the tLoqateAddressRow component to parse, standardize and format the US addresses generated by the tFixedFlowInput component,

  • uses a tFileOutputExcel component to output the correct formatted addresses in an .xsl file.

Prerequisites: Before being able to use the tLoqateAddressRow component, you must order and download the Loqate Local API and the Global Knowledge Repository from http:// www.loqate.com/.

tLoqateAddressRow uses the Q4, 2012 release.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tLoqateAddressRow and tFileOutputExcel.

  2. Connect the three components together using the Main links.

Configuring the input component

  1. Double-click tFixedFlowInput to open its Basic settings view in the Component tab.

  2. Create the schema through the Edit Schema button.

    In the open dialog box, click the plus button and add the columns that will hold the information in the input address, in this example: address_input, COUNTRY and data_description.

  3. Click OK.

  4. In the Number of rows field, set the number of rows as 1.

  5. In the Mode area, select the Use Inline Content (delimited file) option, and set the row and field separators in the corresponding fields.

  6. In the Content table, enter the address data you want to analyze, for example:

    Boise Town Square  421 N Cole Rd   83704,,wrong data
    Boise Capitol 280 S Capitol Blvd  83702,us,both address coutry correct
    Federal Way  3563 South Federal Way    83705,US, both correct
    Salmon Creek In-Store (ALB) 14300 NE 20th Ave Ste.B-101  Vancouver WA 98686,US,both correct
    Battle Ground   2500 West Main Street,,no country;address miss(Battle Ground WA 98604 )
    Battle Ground   2500 West abcd Street,,no country address changed
    south southjkjkjkjkjkj,,wrong data

Configuring the tLoqateAddressRow component

  1. Double-click tLoqateAddressRow to display the Basic settings view and define the component properties.

  2. Click the Edit schema button and define in the output schema all the columns necessary to hold the formatted address you want to get from tLoqateAddressRow.

    Two output columns are read-only: STATUS and ACCURACYCODE. The first column returns the status of processing input addresses. For further information about process status, see Process status in tLoqateAddressRow. The second column returns the verification code for the processed address. For further information about what values this code is made up of and the implications of each segment, see Address verification codes in tLoqateAddressRow.

    In this example, using the same address-input column in the output schema will output the input address. This could be helpful to compare how the address elements were parsed and standardized.

  3. Click OK and accept to propagate the changes.

  4. In the Input Address table:

    • add lines in the table,

    • in the Address Field column, click a line and select from the list, predefined in the component, the fields that hold the input address, Address and Country in this example.

    • in the Input Column column, click a line and select from the list of the input schema the columns that hold the input address, address-input and COUNTRY in this example.

  5. In the Output Address table:

    • add lines in the table,

    • in the Address Field column, click a line and select from the list, predefined in the component, the fields that will hold the output address.

      The component will map the values of these fields to the output columns you set in this table.

      tLoqateAddressRow provides a long list of individual fields because some countries have more complex addressing structures than others. For further information about the output fields, see Address fields in tLoqateAddressRow.

    • in the Output Column column, click a line and select from the list the columns that will hold the standardized output address.

  6. In the Loqate Data Path field, set the path to the Loqate data folder provided by Loqate and installed locally.

Setting a JVM argument and finalizing the Job

  1. Double-click the tFileOutputExcel component to display the Basic settings view and define the component properties.

  2. Set the destination file name as well as the sheet name and then select the Include header and Define all columns auto size check boxes.

  3. Click the Run tab and then in the open view click Advanced settings.

  4. Select the Use specific JVM arguments check box and then click New....

  5. In the pop-up window, set the following JVM argument: -Djava.library.path=<path/to/libloqatejava.dll/folder/>.

    In this argument, you must indicate the folder where the loqate library, called libloqatejava.so on Linux or loqatejava.dll on Windows, is installed.

    Without the correct JVM argument setting, the following error is to be expected: java.lang.Error: java.lang.UnsatisfiedLinkError.

  6. Save your Job and press F6 to execute it.

    The tLoqateAddressRow reads the input address data. It parses, verifies, cleanses, standardizes addresses and gives the result in the output rows you defined in the output schema.

    tLoqateAddressRow matches input address data against the Loqate data file you downloaded locally.

    The STATUS standard output column returns the psOKstatus for all address rows. This means that the verification process of all address rows could be completed successfully by the component. For further information about process status, see Process status in tLoqateAddressRow.

    The ACCURACYCODE standard output column returns a verification code for each of the processed address rows. For example, the first verification code V44-I45-P7-100 means:

    • Verification status = V (verified): a complete match was made between the input address and a single record from the available reference data.

    • Post-processed verification match level = 4 (premises): the level to which the input data matches the available reference data once all changes and additions performed during the verification process have been taken into account.

    • Pre-processed verification match level = 4 (premises): the level to which the input data matches the available reference data prior to any changes or additions performed during the verification process.

    • Parsing status = I (identified and parsed): all components of the input data have been able to be identified and placed into output fields.

    • Lexicon identification match level = 4 (premises): using pattern matching, a numeric value or word has been identified as a premises number or name.

    • Context identification match level = 5 (delivery point, PostBox or SubBuilding): a numeric value or word has been identified as a post box number or sub building name.

    • Postcode Status = P7 (added): the primary postal code for the country has been verified and a secondary postal code has been added.

    • Match score = 100 (complete similarity): the input data and closest reference data match completely.

    For further information about what values this code is made up of and the implications of each segment, see Address verification codes in tLoqateAddressRow.