How to read an email in a Talend Batch Job

author
Rekha Sree
EnrichVersion
6.4
6.3
EnrichProdName
Talend Data Fabric
Talend Open Studio for MDM
Talend Open Studio for Data Quality
Talend Big Data Platform
Talend Real-Time Big Data Platform
Talend MDM Platform
Talend Data Services Platform
Talend Data Management Platform
task
Design and Development
EnrichPlatform
Talend Studio

How to read an email in a Talend Batch Job

This article explains how to read emails from Talend data integration jobs using the tPOP and tFileInputMail components. The tPOP component fetches one or more email messages from a server using the POP3 or IMAP protocol whereas tFileInputMail reads the header and content parts of a defined MIME or MSG email file.
Difference between POP3 and IMAP

POP3 and IMAP are the two widely used protocols for reading email messages from mail servers. These 2 protocols are generally implemented by common email clients like Microsoft Outlook, and others.

POP (Post Office Protocol) is used to download messages fom the mail server. It is a one way operation whereby the email messages are downloaded into the email client on the client computer/device. This operation is often followed by a deletion of the emails from the mail server. Thus, POP can be used to read the messages on 1 device. It does not provide synchronisation capabilities with the email server.

IMAP (Internet Messaging Access Protocol) is a protocol that synchronises the email messages between the server and all the devices connected to the email account. Changes made to email messages on 1 device is synchronised with the email server and all other devices. For example, if an email is marked as read in one email client, this change is reflected on the server and all other computers/devices which are connected to that email account.

With the recent increase in number of devices used, IMAP is generally considered a better protocol as it allows the synchonisation of changes across multiple devices.

As a best practice, it is recommended that developers use the IMAP protocol when reading email messages from a Talend data integration job, since all the operations on the emails are synchronised with the email server.

Environment

The example below is created using:

  • Talend Data Fabric 6.3.1
  • Oracle JDK Build 1.8
  • Windows 7
  • Valid Hotmail account
Designing the Data Integration Job

In this example, we are going to design a data integration job to do the following:

1. Read emails from an email server and store it in local directory

2. Read the email files from local directory

3. Extract standard key data from the email files

4. Print the extracted data into the console window

To implement this example, we will need the email server details like:

  • Hostname (example: imap-mail.outlook.com , smtp.gmail.com , etc)
  • Port (example: 995, 993, etc)
  • Secure Connection type used by the server (SSL or TLS) to retrieve emails from server.

The final job, once completed, will look similar to the screenshot below. We need the components tPop , tFileList , tFileInputMail and tLogRow to complete this job.

Configure the tPOP Component

Configure the tPOP component basic settings as shown in the screenshot below:

  • In the Host field, enter the IP address of the email server you want to connect to.
  • Enter the Port number of the email server.
  • Provide the Username and Password of the account to use to download the emails.
  • Specify the path to the Output directory where email messages will be downloaded and stored as files.
  • The Filename pattern option enables you to define the file name for each individual email message. Each message will be downloaded and stored as a file on disk. You can press Ctrl+Space to display the list of predefined patterns. In this example, the name of the files for each email will be named using a combination of date, year, file and count. The files will have a .mail extension. Therefore, we will enter the following expression TalendDate.getDate("yyyyMMdd-hhmmss") + "_" + (counter_tPOP_1 + 1) + ".mail" as the filename pattern.
  • Check the option Retrieve all emails? if you want to retrieve all email messages present on the specified server. If you do not want to retrieve all emails, you can specify the number of email you want to retrieve. In this example, we want to retrieve 10 emails.
  • If you want to retrieve the emails in chronological order, then check the Newer email first option. This option would be disabled if you want to Retrieve all emails .
  • Select Delete emails from server check box if you do not want to keep the retrieved email messages on the server.

  • Choose the protocol depending on your email server. If you choose the imap protocol, you will be able to select the folder from which you want to retrieve your emails.
  • Check SSL Support option to enable the component to open an SSL connection when communicating with Gmail SMTP server.

  • Configure the context variables for Username and Password as required. In our example, we are using the 2 context variables shown below.

  • Configure the tPOP component advanced settings to specify if you want emails with specific sender , receiver , subject and date range. In this example, we are filtering emails with subject “Talend Jobs” as shown below.

  • We can define multiple filters using the ‘And’ or ‘Or’ options.
Configure the tFileList

The tFileList component is going to iterate over all the files created by the tPOP component so that we can process each file, one by one, in the next step.

Configure the tFileList component as follows:

  • Specify the directory where the files are to be read. This should be the same directory where the tPOP component is writing the email files.
  • If you want to read files from subfolders within the source directory, then check the box ‘Include subdirectories’
  • If the filenames are case sensitive, then select ‘Yes’ in the option Case Sensitive
  • If the directory has no file and you want to the job to generate error, then check the box ‘Generate Error if no file found’
  • In the Files option, give the filemask detail. In the example, we want all the files with ‘ . mail ’ extension to be read.
  • You could specify the order in which files are to be read. In the example we have selected the order by modified date option.
  • Order action lets you select the sorting order of the files. In the example, we want to process the files by modified date in descending order.

Extract the data from the email file

Configure the tFileInputMail component basic settings as follows:

  • Specify the File Name to read and extract data from. In the example we want to process all the files in the folder and hence have used the global variable ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")) which is coming from the tFileList as an iterator variable.
  • Edit the schema and specify the columns for the data part we want to extract. In this example, we are going to extract Date, Author, Objet and Status. We will map these columns to the actual email parts as shown in the screenshot below.
  • In the Attachment export directory , enter the directory where you want to save the attachments of the emails read.
  • In the Mail parts , specify the standard key data you want to extract from the email and how you want to map it into the schema columns you have defined.

Print the results

We will use the tLogRow component to print the values of the schema from the tFileInputMail component. The tLogRow will print the values in the console output window.

Running the Job

Save and run the job. You will see the contents read from the emails in the console window of the studio.

You can check the directory you have specified for the attachements. If an email contains an attachments, the attachments are extracted and saved in this directory. In our example, both emails and attachments are being saved to the same directory (as shown below).