tVerifyEmail - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Warning

This component will be available in the Palette of Talend Studio on the condition that you have subscribed to one of the Talend Platform products.

Function

tVerifyEmail verifies and formats email addresses against patterns and regular expression.

Purpose

tVerifyEmail enables you to verify if email addresses comply with specific rules and correct addresses that do not match the rules by using the content from specific columns.

Simplified pattern syntax for tVerifyEmail

tVerifyEmail enables you to check the local part of email addresses against a simplified pattern.

The following table lists the simplified pattern syntax elements.

Syntax

Equivalent regex

Description

9

[0-9]

A digit

a

[a-z]

A lowercase ASCII letter

A

[A-Z]

An uppercase ASCII letter

w

[a-z]+

One or more lowercase ASCII letters

W

[A-Z]+

One or more uppercase ASCII letters

?

.

Any character

*

.*

Any string

.

\.

The period symbol

[-_+]

[-_+]

Any of the symbols found between square brackets

<pattern>

pattern

Any standard regular expression placed between angle brackets

tVerifyEmail properties

Component family

Data Quality

 

Basic settings

Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are available in any of the Talend solutions.

 

 

Built-In: You create and store the schema locally for this component only. Related topic: see Talend Studio User Guide.

 

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. Related topic: see Talend Studio User Guide.

 

Edit Schema

Click the [...] button and define the input and output schema of the email addresses.

The output schema of tVerifyEmail has different read-only columns depending on the options you select in the component Basic settings view. Read-only output columns include:

VerificationLevel: provides you with the verification status of the processed email addresses as the following:

-VALID: means that the email address comply with the defined rule.

-INVALID: means that the email address does not comply with the defined rule.

-CORRECTED: means that the input email does not comply with the defined rule and has been corrected by using the content of the selected columns. This column is available only when you select the Use column content option in the LOCAL Part Options section.

-VERIFIED: means that the email address does exist at the domain. This column is available only when you select the Check with mail server callback option.

-REJECTED: means that the email address does not exist at the domain. This column is available only when you select the Check with mail server callback option.

Suggested_Email: provides you with a suggested content for the email part before the @ sign. The email string is built up from the columns you select from the Use column content view.

 

Column to validate

Select from the list the column you want to validate with tVerifyEmail.

 

Check the entire email with regular expression

Select this check box if you want to match the complete email address against a specific regular expression.

Complete regular expression: enter the regular expression against which you want to match email addresses.

This match is done as a first step to optimize the matching process and exclude addresses that have problems before going any further to match the local and domain parts of email addresses.

 

LOCAL Part Options

Fields in this section will vary according to what option you select. "LOCAL part" in an email address refers to the string before the @ sign.

-Use regular expression: enter in the Pattern field the expression against which you want to check the local part of the email address.

-Use simplified pattern: enter in the Pattern field the simplified pattern against which you want to check the local part of the email address. Select the Show syntax of simplified pattern option to display the syntax to use for simplified patterns. For more information about the syntax, see Simplified pattern syntax for tVerifyEmail.

-Use column content: use the fields in this view to decide the content against which you want to check the local part of the email. If the local part does not match what you have defined, it will be rewritten by using the content of the fields.

-Enable case-sensitive pattern matching: select this check box to enable a case sensitive pattern matching of the local part of email addresses. You can use case sensitive pattern matching with each of the above options.

 

DOMAIN Part Options

Fields in this view will vary according to what option you select.

-Check the Top-level Domains and the following ones: select this check box to verify the part of the email address which follows the last dot. You can use the Additional Top-level Domains table to add additional top-level domains against which you want to validate email addresses.

-Check domains with a black list: select this option to verify the domains you define in the Domain list table as black listed.

-Check domains with a white list: select this option to verify the domains you define in the Domain List table as white listed.

 

Check with mail server callback

Select this check box to enable the verification of email addresses by the SMTP server.

With this technique, the mail server verifies the complete address (parts before and after the @ sign). It establishes a successful SMTP connection to the mail exchanger (MX) of the email address. Then it queries the exchanger, and make sure that it accepts the address as a valid one. This is done in the same way as sending an email to the address, however the process is stopped after the mail exchanger accepts or rejects the address.

It is not advisable to enable the SMTP verification when you have a lot of email addresses with different domains to check as some mail servers may not reply correctly and even black list your IP address.

The following is a list of cases when the SMTP verification will not work properly:

- When the mail server requires authentication,

- When the mail server has a security policy that may put your IP put into a black list and reject your queries,

- When the mail server is taking too long to reply (time out),

- Any other unexpected exception generated by the mail server.

In all these cases, the component results will only take into account the results from the other rules you set in the component settings.

Advanced settings

tStatCatcher Statistics

Select this check box to collect log data at the component level.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is an intermediary step. It requires an input and output flows.

Limitation

n/a

Scenario: Verify email addresses against column content and domain names

This scenario describes a Job which uses:

  • the tFixedFlowInput component to generate the email addresses to be analyzed,

  • the tverifyEmail component to format the email addresses through Talend email API,

  • the tFileOutputExcel component to output the formatted addresses in an .xls file.

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tVerifyEmail and tFileOutputExcel.

  2. Connect the three components together using the Main links.

Configuring the input component

  1. Double-click tFixedFlowInput to open its Basic settings view in the Component tab.

  2. Create the schema through the Edit Schema button.

    In the open dialog box, click the [+] button and add the columns that will hold input address data. For this example, add firstname, lastname and email.

  3. Click OK.

  4. In the Number of rows field, enter 1.

  5. In the Mode area, select the Use Inline Table option.

  6. In the Inline table, use the [+] button to add lines to the table and then enter the address data you want to analyze.

Verifying and formatting email addresses

  1. Double-click tVerifyEmail to display the Basic settings view and define the component properties.

  2. If required, click Sync columns to retrieve the schema defined in the input component.

  3. Click the Edit schema button to open the schema dialog box.

    tVerifyEmail proposes predefined read-only address columns as shown in the below capture.

    The VerificationLevel column returns the verification status of input email addresses. The SuggestedEmail column returns a suggested content for the email part before the @ sign. This column is shown in the output schema only if you select theUse column content option in the Local Part Options section. For further information about output columns, see tVerifyEmail properties.

  4. Move any of the input columns to the output schema if you want to show them in the verification results, click OK and accept to propagate the changes.

  5. From the Column to validate list, select the email column.

  6. In the LOCAL Part Options section, select the Use column content option.

    In this example, you want to check the email part before the @ sign to see if it starts with the first letter of the first name followed by the family name, all in lower case. If the local part does not match what you have defined, tVerifyEmail will rewrite it by using the parameters you define.

  7. In the DOMAIN Part Options, select:

    • the Check the default Top-level Domains and the following ones check box and define in the table the additional top-level domain against which you want to validate email addresses.

    • the Check domains with a black list check box and define in the Domain list table the domain to consider as black listed.

  8. Select the Check with mail server callback check box to enable the mail server to verify the complete address and accept or reject the email.

Configuring the output component and executing the Job

  1. Double-click the tFileOutputExcel component to display the Basic settings view and define the component properties.

  2. Set the destination file name as well as the sheet name and then select the Define all columns auto size check box.

  3. Save your Job and press F6 to execute it.

    The tVerifyEmail component analyzes email addresses and corrects those that do not match what you have defined in the local and domain part options.

  4. Right-click the output component and select Data Viewer to display the formatted email addresses.

    tVerifyEmail matches input addresses against the rule you set in the LOCAL part options section and the parameters you set for the domain names.

    The VerificationLevel output column returns the status as VALID, INVALID, CORRECTED and REJECTED according to what you set/selected in tVerifyEmail basic settings.

    All email addresses labeled as CORRECTED have a suggested address in the SuggestedEmail output column.