Scenario: Reading email addresses from a DB table and retrieving specific data including invalid rows - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario is a two-component Job created in Talend Studio. In this Job, tMySQLInvalidRows reads the email addresses for people from a specific country from a MySQL database table, filters data using a WHERE clause to narrow down the validation process, checks the email values against the given Talend Studio email pattern and finally extracts filtered data including the invalid rows and displays them on the console.

Below is the database table used in this example, some customers are from the USA and others are from Canada. The Email column contains some invalid addresses. The tMySQLInvalidRows component filters data in the Email column to read only the emails for the customers from the USA, and then validates these email addresses against the EmailAddress pattern.

In this scenario, we have already stored the schemas of the input table in the Repository. For more information about storing schema metadata in the Repository tree view, see Talend Studio User Guide.

Setting up your Job

  1. In the Repository tree view, expand in succession the Metadata and the DB Connections nodes where you have stored the input schemas and drop the relevant connection onto the design workspace.

    The [Components] dialog box displays.

  2. Select tMySQLInvalidRows from the list and click OK to close the dialog box.

    The tMySQLInvalidRows component displays on the workspace. The MySQL table used in this scenario is called customers. It holds the customers' customer_id, account_num, Iname, fname, country and Email columns.

  3. Drop tLogRow from the Palette onto the design workspace.

  4. Connect the two components together using the Main link.

Setting up the schema

  1. Double-click tMySQLInvalidRows to display its Basic settings view and define its properties.

    All database connection fields are automatically filled in. If you do not define your input files in the Repository, fill in the details manually after selecting Built-in in the Property Type field

  2. From the Validation type list, select Regex pattern validation.

    This will validate data in the selected rows against a regex pattern. For an example scenario about validating data against DQ rule, see Scenario 2: Checking customer table against a given DQ rule to select customer records.

  3. In the Table Name field, type in the name of the database table on which you want to run the Job, customer1 in this example.

  4. In the Where clause field, type in the WHERE clause that will restrict the number of the analyzed rows.

    In this example you want to examine only the emails of the customers from the USA.

  5. In the Analyzed column list, select the column you want to analyze, Email in this example.

  6. In the Patterns list, select the Talend Studio database pattern against which you want to check addresses from the Email column, Email Address in this example.

    If you select the Custom pattern check box, you can customize the regular expression against which the data is to be checked.

  7. Double click tLogRow to open its Basic settings view and define its properties as needed.

Executing the Job

  1. Save your Job.

  2. Press F6 to execute it.

    The tMySQLInvalidRows component analyzed the email addresses only for customers from the USA in the database table against the selected SQL pattern (Email Address), and then it extracted all filtered data including the invalid email addresses.