Scenario 2: Matching customer data through multiple passes - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
EnrichPlatform
Talend Studio
task
Data Governance
Data Quality and Preparation
Design and Development

The Job in this scenario, groups similar customer records by running through two subsequent matching passes (tMatchGroup components) and outputs the calculated matches in groups. Each pass provides its matches to the pass that follows in order for the latter to add more matches identified with new rules and blocking keys.

In this Job:

  • The tMysqlInput component connects to the customer records to be processed.

  • Each of the tGenKey components defines a way to partition data records. The first key partitions data to many groups and the second key creates fewer groups that overlaps the previous blocks depending on the blocking key definition.

  • The tMap component renames the key generated by the second tGenKey component.

  • The first tMatchGroup processes the partitions defined by the first tGenKey, and the second tMatchGroup processes those defined by the second tGenKey.

    Warning

    The two tMatchGroup components must have the same schema.

  • The tLogRow component presents the matching results after the two passes.

Setting up the Job

In this scenario, the main input schema is already stored in the Repository. For more information about storing schema metadata in the repository, see the Talend Studio User Guide.

  1. In the Repository tree view, expand Metadata - DB Connections where you have stored the main input schema and drop the database table onto the design workspace. The input table used in this scenario is called customer.

    A dialog box is displayed with a list of components.

  2. Select the relevant database component, tMysqlInput in this example, and then click OK.

  3. Drop two tGenKey components, two tMatchGroup components, a tMap and a tLogRow components from Palette onto the design workspace.

  4. Link the input component to the tGenKey and tMap components using Main links.

  5. In the two tMatchGroup components, select the Output distance details check boxes in the Advanced settings view of both components before linking them together.

    This will provide the MATCHING_DISTANCES column in the output schema of each tMatchGroup.

    Note

    If the two tMatchGroup components are already linked to each other, you must select the Output distance details check box in the second component in the Job flow first otherwise you may have an issue.

  6. Link the two tMatchGroup components and the tLogRow component using Main links.

  7. If needed, give the components specific labels to reflect their usage in the Job.

    For further information about how to label a component, see Talend Studio User Guide.

Configuring input data and key generation

Connecting to the input data

  1. Double-click tMysqlInput to open its Component view.

    The property fields for tMysqlInput are automatically filled in. If you do not define your input schema locally in the repository, fill in the details manually after selecting Built-in in the Schema and Property Type lists.

    The input table used in this scenario is called customer.

  2. Modify the query in the Query box to select only the columns you want to match: account_name, lname, fname, mi, address1, city, state_province and postal_code.

Configuring the key generation for the first pass

  1. Double-click the first tGenKey to open the Component view.

  2. Click and import blocking keys from match rules created and tested in the Profiling perspective of Talend Studio and use them in your Job. Otherwise, define the blocking key parameters as described in the below steps.

  3. Under the Algorithm table, click the [+] button to add two rows in the table.

  4. On the column column, click the newly added row and select from the list the column you want to process using an algorithm. In this example, select lname.

  5. Do the same on the second row to select postal_code.

  6. On the pre-algorithm column, click the newly added row and select from the list the pre-algorithm you want to apply to the corresponding column.

    In this example, select remove diacritical marks and convert to upper case to remove any diacritical mark and converts the fields of the lname column to upper case.

    Note

    This conversion does not change your raw data.

  7. On the algorithm column, click the newly added row and select from the list the algorithm you want to apply to the corresponding column. In this example, select N first characters of each word.

  8. Do the same for the second row on the algorithm column to select first N characters of the string.

  9. Click in the Value column next to the algorithm column and enter the value for the selected algorithm, when needed.

    In this scenario, enter 1 for both rows. The first letter of each field in the corresponding columns will be used to generate the key.

Configuring the key generation for the second pass

  1. Double-click the second tGenKey to open the Component view.

  2. In the Algorithm table, define the column you want to use to partition data, account_num in this component. Select the first N characters of the string algorithm and set the value to 1 in the Value column.

    Each of the two tGenKey components will generate a read_only T_GEN_KEY column in the output schema. You must rename one of theT_GEN_KEY columns to stop them from overwriting each other.

  3. Double-click the tMap component to open its editor.

  4. In the Schema editor, copy the columns from the first table onto the second table and rename T_GEN_KEY to T_GEN_KEY1, for example.

  5. In the top part of the editor, drop all columns from the input table to the output table.

  6. Click Ok to save data transformation and close the editor.

  7. In the tGenKey basic settings, click the button to verify that the two generated keys are named differently in the output schema.

Configuring the two matching passes

Configuring the first pass

  1. Double-click the first tMatchGroup labelled pass1 to display the Configuration Wizard.