Scenario 2: Deduplicating entries based on dynamic schema - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

In this use case, we will use a Job similar to the one in the scenario described earlier to deduplicate the input entries about several families, so that only one person per family stays on the name list. As all the components in this Job support the dynamic schema feature, we will leverage this feature to save the time of configuring individual columns of the schemas.

Setting up the Job

  1. Drop these components from the Palette to the design workspace: tFileInputDelimited, tExtractDynamicFields, tUniqRow, tFileOutputDelimited, and tLogRow, and name the components as shown above to better identify their roles in the Job.

  2. Connect the component labelled People, the component labelled Split_Column, and the component labelled Deduplicate using Row > Main connections.

  3. Connect the component labelled Deduplicate and the component labelled Unique_Families using a Main > Uniques connection.

  4. Connect the component labelled Deduplicate and the component labelled Duplicated_Families using a Main > Duplicates connection.

Configuring the components

  1. Double-click the component labelled People to display its Basic settings view.

    Warning

    The dynamic schema feature is only supported in Built-In mode and requires the input file to have a header row.

  2. Click the [...] button next to the File Name/Stream field to browse to your input file.

  3. Define the header and footer rows. In this use case, the first row of the input file is the header row.

  4. Click Edit schema to define the schema for this component.

    In this use case, the input file has five columns: FirstName, LastName, HouseNo, Street, and City. However, as we can leverage the advantage of the dynamic schema feature, we simply define one dynamic column in the schema, Dyna in this example.

    To do so :

    1. Add a new line by clicking the [+] button.

    2. Type Dyna in the Column field.

    3. Select Dynamic from the Type list.

    4. Then, click OK to propagate the schema and close the [Schema] dialog box.

  5. Double-click the component labelled Split_Column to display its Basic settings view.

    We will use this component to split the dynamic column of the input schema into two columns, one for the first name and the other for the family related information. To do so:

    1. Click Edit schema to open the [Schema] dialog box.

    2. In the output panel, click the [+] button to add two columns for the output schema, and name them FirstName and FamilyInfo respectively.

    3. Select String from the Type list for the FirstName column to extract this column from the input schema to carry the first name of each person on the name list.

    4. Select Dynamic from the Type list for the FamilyInfo column so that this column will carry the rest information of each person on the name list: the last name, house number, street and city, which all together will identify a family.

    5. Then, click OK to propagate the schema and close the [Schema] dialog box.

  6. Double-click the component labelled Deduplicate to display its Basic settings view.

  7. In the Unique key area, select the Key attribute check box for the FamilyInfo column to carry out deduplication on the family information.

  8. In the Basic settings view of the tFileOutputDelimited component, which is labelled Deduplicated_Families, define the output file path, select the Include header check box, and leave the other settings as they are.

  9. In the Basic settings view of the tLogRow component, which is labelled Duplicated_Families, select the Table option to view the Job execution result in table mode.

Saving and executing the Job

  1. Press Ctrl+S to save your Job.

  2. Run the Job by pressing F6 or clicking the Run button on the Run tab.

    The information of duplicated families is displayed on the Run console, and only one person per family stays on the name list in the output file.