Scenario: Checking the data in multiple columns against patterns - 6.1

Talend Components Reference Guide

EnrichVersion
6.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

This scenario describes a four-component Job that checks customers' last and first names and email against the relevant patterns. It lists data that matches the selected patterns and data that does not.

The check results are written in two output files: the first for the values that match the selected patterns and the second for the values that do not match the selected patterns. Rejected data has a message to tell what pattern was not validated.

Setting up the Job

In this scenario, we have already stored the main input schema in the Repository. For more information about storing schema metadata in the Repository, see Talend Studio User Guide.

The main input table contains three columns: lname, fname and email. We want to check the entries in these columns against patterns.

  1. In the Repository tree view, expand Metadata - DB Connections where you have stored the main input schema and drop the relevant file onto the design workspace.

    The [Components] dialog box is displayed.

  2. Select the tMysqlInput component, and click OK to drop it onto the workspace.

    The input table used in this scenario is called customer. It holds several columns including the three columns against which we want to do a pattern check.

  3. Drop the following components from the Palette onto the design workspace: tMultiPatternCheck and two tLogRow.

  4. Connect the main input component to tMultiPatternCheck using a Main > Row link.

  5. Connect tMultiPatternCheck to the two tLogrow components using the Matches, and Non Matches links.

Configuring the components

  1. Double-click tMultiPatternCheck to display its Basic settings view and define its properties.

  2. Click Edit schema to open a dialog box. Here you can define the data you want to pass to the output components, and then click OK to close the dialog box.

    In this example we want to pass to the tMultiPatternCheck component all the columns in the main input columns.

  3. Click in the Check Pattern column and select from the list the patterns against which you want to check the data in the columns.

    In this example, you want to check if customer first and last names start with upper case and if emails are valid addresses.

  4. Select from the pattern list the Starts with uppercase pattern for the first and last names and the Email Address for the customer email.

    The patterns in this list are retrieved from the DQ Repository of your studio. The list includes the system and user-defined patterns.

  5. In the Is Case Sensitive column, select the check boxes next to the column name where you want to consider, when doing the pattern check, the lower and upper cases.

  6. In the Check column, select the check boxes next to the column names you want to check against the defined patterns, all columns in this example.

  7. Leave the Message column empty if you want to have the automatic message about what pattern is not validated. Otherwise, set your own message.

  8. Double-click the first tLogRow component to display its Basic settings view and define its properties.

  9. In the Mode area, select the Table option to print results in a table.

    Do the same for the second tLogRow component.

Executing the Job

  • Save your Job and press F6 to execute it.

    Two output tables are written on the console. The first table lists the data entries in the three defined columns that match the selected patterns. The second table lists non match entries in the three columns according to the used patterns.

    The REGEX_INVALIDITY_MESSAGE column in the second table provides the name of the patterns that were not validated and because of which the rows were rejected.

    The figure below illustrates extractions of the two output tables.