Scenario 2: Matching input data against a reference file based on a dynamic column - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a five-component Job that matches the family information entries in the main input file against those in a reference input file, and displays the exact matches and the rejected data in different tables on the console. The dynamic feature is leveraged to save the time of configuring individual columns in the schema of each component.

Dropping and linking the components

  1. Drop two tFileInputDelimited components, a tJoin component, and two tLogRow components from the Palette onto the design workspace, and label them to better identify their roles in the Job, as shown above.

  2. Connect the tFileInputDelimited component labelled Main_Input to the tJoin component, which is labelled Check, using a Row > Main connection.

  3. Repeat the step above to connect the tFileInputDelimited component labelled Ref_Input to the tJoin component. This Row connection automatically appears as a lookup link.

  4. Connect the tJoin component to the tLogRow component labelled Matches using a Row > Main connection. This link will gather the data of the exact matches.

  5. Connect the tJoin component to the tLogRow component labelled Rejects using a Row > Inner join reject connection. This link will gather the rejected data.

Configuring the components

  1. Double-click the tFileInputDelimited component labelled Main_Input to display its Basic settings view.

    Warning

    The dynamic schema feature is only supported in Built-In mode and requires the input file to have a header row.

  2. Click the [...] button next to the File Name/Stream field to browse to your main input file, and type in 1 in the Header field to define the first row as the header row.

    In this use case, the main input file contains the following information:

    FirstName;LastName;HouseNo;Street;City
    Gerald;Roosevelt;48;Fairview Avenue;Oklahoma City
    Benjamin;Harrison;27;Katella Avenue;Little Rock
    Bob;Clinton;11;Bowles Avenue;Raleigh
    James;Quincy;45;Cerrillos Road;Saint Paul
    Gerald;Harrison;27;Katella Avenue;Little Rock
    Harry;Madison;85;Santa Monica Road;Raleigh
    Helen;Roosevelt;48;Fairview Avenue;Oklahoma City
    Mary;Clinton;11;Bowles Avenue;Raleigh
    Cathey;Quincy;45;Cerrillos Road;Saint Paul
    John;Smith;64;Market Street;Helena
  3. Click Edit schema to define the schema for this component.

    In this use case, the main input file has five columns: FirstName, LastName, HouseNo, Street, and City. However, as we can leverage the advantage of the dynamic schema feature, we simply define two columns: one string type of column for the first names of people, and one dynamic column for the family information. To do so:

    1. Click the [+] button to add two columns, and name them FirstName and FamilyInfo respectively.

    2. Select String from the Type list for the FirstName column to retrieve the first name of each person on the name list.

    3. Select Dynamic from the Type list for the FamilyInfo column to retrieve the rest information of each person on the name list: the last name, house number, street, and city, which all together will identify a family.

    4. Click OK to propagate the schema and close the [Schema] dialog box.

  4. Following steps similar to the above, define the properties for the tFileInputDelimited component labelled Ref_Input: the path to the reference input file, the header row, and the schema. This time, just define one dynamic column, FamilyInfo, to retrieve the four columns of the reference input file, which contains the following information:

    LastName;HouseNo;Street;City
    Clinton;11;Bowles Avenue;Raleigh
    Quincy;45;Cerrillos Road;Saint Paul
    Smith;64;Market Street;Helena
  5. Double-click the tJoin component to open its Basic settings view.

  6. Click Edit schema to open the [Schema] dialog box to check the data structures of the input files and define the data you want to pass to the output components.

    In this scenario, we want to pass both columns of the main input file, FirstName and FamilyInfo, to the output files, so simply copy the schema columns of the main input file by clicking the [->>] button. Then, click OK to validate the schema and close the dialog box.

  7. In the Key definition area, click the [+] button to add one column to the list and then select the input column you want to match from the Input key attribute list and the reference column against which you want match the input column from Lookup key attribute list, FamilyInfo and row2.FamilyInfo respectively in this example.

  8. Make sure that the Inner join (with reject output) check box is selected to define one of the outputs as inner join reject table.

  9. In the Basic settings view of each tLogRow component, select the Table option to display the output information in table cells.

Saving and executing the Job

  1. Press Ctrl+S to save your Job.

  2. Press F6, or click Run on the Run tab to execute the Job.

    The console displays the exact matches and rejected data in two different tables.