tFileList - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Function

tFileList iterates on files or folders of a set directory.

Purpose

tFileList retrieves a set of files or folders based on a filemask pattern and iterates on each unity.

tFileList properties

Component family

File/Management

 

Basic settings

Directory

Path to the directory where the files are stored.

 

FileList Type

Select the type of input you want to iterate on from the list:

Files if the input is a set of files,

Directories if the input is a set of directories,

Both if the input is a set of the above two types.

 

Include subdirectories

Select this check box if the selected input source type includes sub-directories.

 

Case Sensitive

Set the case mode from the list to either create or not create case sensitive filter on filenames.

 

Generate Error if no file found

Select this check box to generate an error message if no files or directories are found.

 

Use Glob Expressions as Filemask

This check box is selected by default. It filters the results using a Global Expression (Glob Expressions).

 

Files

Click the plus button to add as many filter lines as needed:

Filemask: in the added filter lines, type in a filename or a filemask using special characters or regular expressions.

 

Order by

The folders are listed first of all, then the files. You can choose to prioritise the folder and file order either:

By default: alphabetical order, by folder then file;

By file name: alphabetical order or reverese alphabetical order;

By file size: smallest to largest or largest to smallest;

By modified date: most recent to least recent or least recent to most recent.

Note

If ordering by file name, in the event of identical file names then modified date takes precedence. If ordering by file size, in the event of identical file sizes then file name takes precedence. If ordering by modified date, in the event of identical dates then file name takes precedence.

 Order action

Select a sort order by clicking one of the following radio buttons:

ASC: ascending order;

DESC: descending order;

Advanced settings

Use Exclude Filemask

Select this check box to enable Exclude Filemask field to exclude filtering condition based on file type:

Exclude Filemask: Fill in the field with file types to be excluded from the Filemasks in the Basic settings view.

Note

File types in this field should be quoted with double quotation marks and seperated by comma.

 

Format file path to slash(/) style(useful on Windows)

Select this check box to format the file path to slash(/) style which is useful on Windows.
 

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at a Job level as well as at each component level.

Usage

tFileList provides a list of files or folders from a defined directory on which it iterates

Global Variables

CURRENT_FILE: the current file name. This is a Flow variable and it returns a string.

CURRENT_FILEPATH: the current file path. This is a Flow variable and it returns a string.

CURRENT_FILEEXTENSION: the extension of the current file. This is a Flow variable and it returns a string.

CURRENT_FILEDIRECTORY: the current file directory. This is a Flow variable and it returns a string.

NB_FILE: the number of files iterated upon so far. This is a Flow variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Connections

Outgoing links (from this component to another):

Row: Iterate

Trigger: On Subjob Ok; On Subjob Error; Run if; On Component Ok; On Component Error.

Incoming links (from one component to this one):

Row: Iterate.

Trigger: Run if; On Subjob Ok; On Subjob Error; On component Ok; On Component Error; Synchronize; Parallelize.

For further information regarding connections, see Talend Studio User Guide.

Log4j

If you are using a subscription-based version of the Studio, the activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User Guide.

For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html.

Limitation

n/a

Scenario 1: Iterating on a file directory

The following scenario creates a three-component Job, which aims at listing files from a defined directory, reading each file by iteration, selecting delimited data and displaying the output in the Run log console.

Dropping and linking the components

  1. Drop the following components from the Palette to the design workspace: tFileList, tFileInputDelimited, and tLogRow.

  2. Right-click the tFileList component, and pull an Iterate connection to the tFileInputDelimited component. Then pull a Main row from the tFileInputDelimited to the tLogRow component.

Configuring the components

  1. Double-click tFileList to display its Basic settings view and define its properties.

  2. Browse to the Directory that holds the files you want to process. To display the path on the Job itself, use the label (__DIRECTORY__) that shows up when you put the pointer anywhere in the Directory field. Type in this label in the Label Format field you can find if you click the View tab in the Basic settings view.

  3. In the Basic settings view and from the FileList Type list, select the source type you want to process, Files in this example.

  4. In the Case sensitive list, select a case mode, Yes in this example to create case sensitive filter on file names.

  5. Keep the Use Glob Expressions as Filemask check box selected if you want to use global expressions to filter files, and define a file mask in the Filemask field.

  6. Double-click tFileInputDelimited to display its Basic settings view and set its properties.

  7. Enter the File Name field using a variable containing the current filename path, as you filled in the Basic settings of tFileList. Press Ctrl+Space bar to access the autocomplete list of variables, and select the global variable ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) . This way, all files in the input directory can be processed.

  8. Fill in all other fields as detailed in the tFileInputDelimited section. Related topic: tFileInputDelimited.

  9. Select the last component, tLogRow, to display its Basic settings view and fill in the separator to be used to distinguish field content displayed on the console. Related topic: tLogRow.

Executing the Job

Press Ctrl + S to save your Job, and press F6 to run it.

The Job iterates on the defined directory, and reads all included files. Then delimited data is passed on to the last component which displays it on the console.

Scenario 2: Finding duplicate files between two folders

This scenario describes a Job that iterates on files in two folders, transforms the iteration results to data flows to obtain a list of filenames, and then picks up all duplicates from the list and shows them on the Run console, as a preparation step before merging the two folders, for example.

Dropping and linking the components

  1. From the Palette, drop two tFileList components, two tIterateToFlow components, two tFileOutputDelimited components, a tFileInputDelimited component, a tUniqRow component, and a tLogRow component onto the design workspace.

  2. Link the first tFileList component to the first tIterateToFlow component using a Row > Iterate connection, and the connect the first tIterateToFlow component to the first tFileOutputDelimited component using a Row > Main connection to form the first subjob.

  3. Link the second tFileList component to the second tIterateToFlow component using a Row > Iterate connection, and the connect the second tIterateToFlow component to the second tFileOutputDelimited component using a Row > Main connection to form the second subjob.

  4. Link the tFileInputDelimited to the tUniqRow component using a Row > Main connection, and the tUniqRow component to the tLogRow component using a Row > Duplicates connection to form the third subjob.

  5. Link the three subjobs using Trigger > On Subjob Ok connections so that they will be triggered one after another, and label the components to better identify their roles in the Job.

Configuring the components

  1. In the Basic settings view of the first tFileList component, fill the Directory field with the path to the first folder you want to read filenames from, E:/DataFiles/DI/images in this example, and leave the other settings as they are.

  2. Double-click the first tIterateToFlow component to show its Basic settings view.

  3. Double-click the [...] button next to Edit schema to open the [Schema] dialog box and define the schema of the text file the next component will write filenames to. When done, click OK to close the dialog box and propagate the schema to the next component.

    In this example, the schema contains only one column: Filename.