Scenario 6: Advanced mapping with lookup reload at each row - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

The following scenario describes a Job that retrieves people details from a lookup database, based on a join on the age. The main flow source data is read from a MySQL database table called people_age that contains people details such as numeric id, alphanumeric first name and last name and numeric age. The people age is either 40 or 60. The number of records in this table is intentionally restricted.

The reference or lookup information is also stored in a MySQL database table called large_data_volume. This lookup table contains a number of records including the city where people from the main flow have been to. For the sake of clarity, the number of records is restricted but, in a normal use, the usefulness of the feature described in the example below is more obvious for very large reference data volume.

To optimize performance, a database connection component is used in the beginning of the Job to open the connection to the lookup database table in order not to do that every time we want to load a row from the lookup table.

An Expression Filter is applied to this lookup source flow, in order to select only data from people whose age is equal to 60 or 40. This way only the relevant rows from the lookup database table are loaded for each row from the main flow.

Therefore this Job shows how, from a limited number of main flow rows, the lookup join can be optimized to load only results matching the expression key.

Note

Generally speaking, as the lookup loading is performed for each main flow row, this option is mainly interesting when a limited number of rows is processed in the main flow while a large number of reference rows are to be looked up to.

The join is solved on the age field. Then, using the relevant loading option in the tMap component editor, the lookup database information is loaded for each main flow incoming row.

For this Job, the metadata has been prepared for the source and connection components. For more information on how to set up the DB connection schema metadata, see the relevant section in the Talend Studio User Guide.

This Job is formed with five components, four database components and a mapping component.

Linking the components

  1. Drop the DB Connection under the Metadata node of the Repository to the design workspace. In this example, the source table is called people_age.

  2. Select tMysqlInput from the list that pops up when dropping the component.

  3. Drop the lookup DB connection table from the Metadata node to the design workspace selecting tMysqlInput from the list that pops up. In this Job, the lookup is called large_data_volume.

  4. The same way, drop the DB connection from the Metadata node to the design workspace selecting tMysqlConnection from the list that pops up. This component creates a permanent connection to the lookup database table in order not to do that every time we want to load a row from the lookup table.

  5. Then pick the tMap component from the Processing family, and the tMysqlOutput and tMysqlCommit components from the Database family in the Palette to the right hand side of the editor.

  6. Now connect all the components together. To do so, right-click the tMysqlInput component corresponding to the people table and drag the link towards tMap.

  7. Release the link over the tMap component, the main row flow is automatically set up.

  8. Rename the Main row link to people, to identify more easily the main flow data.

  9. Perform the same operation to connect the lookup table (large_data_volume) to the tMap component and the tMap to the tMysqlOutput component.

  10. A dialog box prompts for a name to the output link. In this example, the output flow is named: people_mixandmatch.

  11. Rename also the lookup row connection link to large_volume, to help identify the reference data flow.

  12. Connect tMysqlConnection to tMysqlInput using the trigger link OnSubjobOk.

  13. Connect the tMysqlInput component to the tMysqlCommit component using the trigger link OnSubjobOk.

Configuring the components

  1. Double-click the tMap component to open the graphical mapping editor.

  2. The Output table (that was created automatically when you linked the tMap to the tMySQLOutput will be formed by the matching rows from the lookup flow (large_data_volume) and the main flow (people_age).

    Select the main flow rows that are to be passed on to the output and drag them over to paste them in the Output table (to the right hand side of the mapping editor).

    In this example, the selection from the main flow include the following fields: id, first_name, last_Name and age.

    From the lookup table, the following column is selected: city.

    Drop the selected columns from the input tables (people and large_volume) to the output table.

  3. Now set up the join between the main and lookup flows.

    Select the age column of the main flow table (on top) and drag it towards the age column of the lookup flow table (large_volume in this example).

    A key icon appears next to the linked expression on the lookup table. The join is now established.

  4. Click the tMap settings button, click the three-dot button corresponding to Lookup Model, and select the Reload at each row option from the [Options] dialog box in order to reload the lookup for each row being processed.

  5. In the same way, set Match Model to All matches in the Lookup table, in order to gather all instances of age matches in the output flow.

  6. Now implement the filtering, based on the age column, in the Lookup table. The GlobalMapKey field is automatically created when you selected the Reload at each row option. Indeed you can use this expression to dynamically filter the reference data in order to load only the relevant information when joining with the main flow.

    As mentioned in the introduction of the scenario, the main flow data contains only people whose age is either 40 or 60. To avoid the pain of loading all lookup rows, including ages that are different from 40 and 60, you can use the main flow age as global variable to feed the lookup filtering.

  7. Drop the Age column from the main flow table to the Expr. field of the lookup table.

  8. Then in the globalMap Key field, put in the variable name, using the expression. In this example, it reads: "people.Age"

    Click OK to save the mapping setting and go back to the design workspace.

  9. To finalize the implementation of the dynamic filtering of the lookup flow, you need now to add a WHERE clause in the query of the database input.

  10. At the end of the Query field, following the Select statement, type in the following WHERE clause: WHERE AGE ='"+((Integer)globalMap.get("people.Age"))+"'"

  11. Make sure that the type corresponds to the column used as variable. In this use case, Age is of Integer type. And use the variable the way you set in the globalMap key field of the map editor.

  12. Double-click the tMysqloutput component to define its properties.

  13. Select the Use an existing connection check box to leverage the created DB connection.

    Define the target table name and relevant DB actions.

Executing the Job

  1. Press Ctrl + S to save the Job.

  2. Click the Run tab at the bottom of the design workspace, to display the Job execution tab.

  3. From the Debug Run view, click the Traces Debug button to view the data processing progress.

    For more comfort, you can maximize the Job design view while executing by simply double-clicking on the Job name tab.

    The lookup data is reloaded for each of the main flow's rows, corresponding to the age constraint. All age matches are retrieved in the lookup rows and grouped together in the output flow.

    Therefore if you check out the data contained in the newly created people_mixandmatch table, you will find all the age duplicates corresponding to different individuals whose age equals to 60 or 40 and the city where they have been to.

    For examples of how to use dynamic schemas with tMap, see: