Scenario 2: Finding duplicate files between two folders - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a Job that iterates on files in two folders, transforms the iteration results to data flows to obtain a list of filenames, and then picks up all duplicates from the list and shows them on the Run console, as a preparation step before merging the two folders, for example.

Dropping and linking the components

  1. From the Palette, drop two tFileList components, two tIterateToFlow components, two tFileOutputDelimited components, a tFileInputDelimited component, a tUniqRow component, and a tLogRow component onto the design workspace.

  2. Link the first tFileList component to the first tIterateToFlow component using a Row > Iterate connection, and the connect the first tIterateToFlow component to the first tFileOutputDelimited component using a Row > Main connection to form the first subjob.

  3. Link the second tFileList component to the second tIterateToFlow component using a Row > Iterate connection, and the connect the second tIterateToFlow component to the second tFileOutputDelimited component using a Row > Main connection to form the second subjob.

  4. Link the tFileInputDelimited to the tUniqRow component using a Row > Main connection, and the tUniqRow component to the tLogRow component using a Row > Duplicates connection to form the third subjob.

  5. Link the three subjobs using Trigger > On Subjob Ok connections so that they will be triggered one after another, and label the components to better identify their roles in the Job.

Configuring the components

  1. In the Basic settings view of the first tFileList component, fill the Directory field with the path to the first folder you want to read filenames from, E:/DataFiles/DI/images in this example, and leave the other settings as they are.

  2. Double-click the first tIterateToFlow component to show its Basic settings view.

  3. Double-click the [...] button next to Edit schema to open the [Schema] dialog box and define the schema of the text file the next component will write filenames to. When done, click OK to close the dialog box and propagate the schema to the next component.

    In this example, the schema contains only one column: Filename.

  4. In Value field of the Mapping table, press Ctrl+Space to access the autocomplete list of variables, and select the global variable ((String)globalMap.get("tFileList_1_CURRENT_FILE")) to read the name of each file in the input directory, which will be put into a data flow to pass to the next component.

  5. In the Basic settings view of the first tFileOutputDelimited component, fill the File Name field with the path of the text file that will store the filenames from the incoming flow, D:/temp/tempdata.csv in this example. This completes the configuration of the first subjob.

  6. Repeat the steps above to complete the configuration of the second subjob, but:

    • fill the Directory field in the Basic settings view of the second tFileList component with the other folder you want to read filenames from, E:/DataFiles/DQ/images in this example.

    • select the Append check box in the Basic settings view of the second tFileOutputDelimited component so that the filenames previously written to the text file will not be overwritten.

  7. In the Basic settings view of the tFileInputDelimited component, fill the File name/Stream field with the path of the text file that stores the list of filenames, D:/temp/tempdata.csv in this example, and define the file schema, which contains only one column in this example, Filename.

  8. In the Basic settings view of the tUniqRow component, select the Key attribute check box for the only column, Filename in this example.

  9. In the Basic settings view of the tLogRow component, select the Table (print values in cells of a table) option for better display effect.

Executing the Job

  1. Press Ctrl+S to save your Job.

  2. Click Run or press F6 to run the Job.

    All the duplicate files between the selected folders are displayed on the console.

For other scenarios using tFileList, see tFileCopy.