tFindRegexlibExpressions - 6.3

Talend Components Reference Guide

EnrichVersion
6.3
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Open Studio for Big Data
Talend Open Studio for Data Integration
Talend Open Studio for Data Quality
Talend Open Studio for ESB
Talend Open Studio for MDM
Talend Real-Time Big Data Platform
task
Data Governance
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Warning

This component is available in the Palette of the Studio only if you have subscribed to one of the Talend Platform products.

Function

tFindRegexlibExpressions connects to a web service at http://regexlib.com to get a list of regular expressions for all languages, even those that are not supported by Talend.

Purpose

tFindRegexlibExpressions returns a data set holding information about all of the regular expressions that match the request sent to the web server. Then you can keep this information

tFindRegexlibExpressions

Component family

Data Quality

 

Basic settings

Schema and Edit Schema

These fields are read-only. The schema of this component contains the following fields: Title, Expression, Description, Matches, Non-Matches, Author, Rating.

 

Regexp Substring

Define a regular expression substring you want to use as a filter on the regular expression list.

 

Key Words

Enter the key word(s) you want to use as a filter on the regular expression list. Key words are separated by commas.

 

Min Rate

Define a regular expression rating you want to use as a filter on the regular expression list.

 

Relative path

Type in the relative path pointing to the pattern folder you need to create under the Patterns > Regex node in the DQ Repository tree view for keeping the retrieved patterns. For example, you need to create a folder called phone with a sub-folder uk for the phone patterns used in the U.K., then type in "phone/uk" in this Relative path field.

In order to create definitely the pattern folder in the DQ Repository, you must import therein the retrieved regular expressions that have been stored in a .csv file. For further information about how to import regular expression from a .csv file, see the Talend Studio User Guide.

Advanced settings

tStat Catcher Statistics

Select this check box to collect log data at the component level.

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

This component is a start component. It requires an output flow, usually a csv file. You can later import all collected expressions from a well formatted csv file into Talend Studio.

For more information about importing patterns, see Talend Studio User Guide.

Limitation

n/a

Scenario: Connecting to a web service and returning a list of regular expressions

This scenario is a three-component Java Job created in Talend Studio.

This scenario:

  • uses the tFindRegexlibExpression component to connect to a web server and collects all regular expressions that have the word "email" in their description field,

  • uses the tMap component to reorganize the incoming data in the output flow and also to concatenate the two fields from the incoming data flow in one output column,

  • and finally writes all collected expressions in an csv file.

This Job can also be generated automatically from the Patterns > Regex node in the DQ Repository tree view. For further information about how to generate a Job to recuperate regular expressions, see the Talend Studio User Guide.

Configuring the tFindRegexlibExpressions component

  1. Drop the following components from the Palette onto the design workspace: tFindRegexlibExpressions, tMap, and tFileOutputDelimited.

  2. Double-click the tFindRegexlibExpressions component to open its Basic settings view and define its properties.

    The schema of this component is read-only and it contains the following fields: Title, Expression, Description, Matches, Non-Matches, Author, Rating and Relative_path.

  3. In the Regexp Substring field, define a regular expression substring you want to use as a filter on the regular expression list.

  4. In the Key Words field, define the key word(s) you want to use as a filter on the regular expression list.

  5. In the Min Rate field, define a regular expression rating you want to use as a filter on the regular expression list.

  6. In the Relative path field, type in the relative path pointing to the folder to be created in the Patterns > Regex node of the DQ Repository tree view for the retrieved patterns. In this example, this folder is email.

    In this scenario we want tFindRegexlibExpressions to collect all regular expressions on the web server that have the word "email" in their Description field and those which rate is at least 1.

  7. Connect tFindRegexlibExpressions and tMap using a Main row link.

Configuring the tMap component

  1. Double-click the tMap component to open the Map Editor and do necessary fields reorganization and concatenation.

  2. In the Map Editor, click the plus button in the upper-right corner to open a dialog box where you can give a name to the new output table, regex in this scenario.

    This will create a new link in the tMap component holding the same name and that you can use to connect tMap to the next component.

  3. In the lower-right corner of the map Editor, click the plus button to define the fields in the regex output table.

  4. In the upper half of the Map Editor, drop fields from the input table to fill the fields of the output schema as necessary. For more information regarding data mapping, see Talend Studio User Guide.

    In this scenario, we want to concatenate the Matches, and Non-Matches fields from the incoming data flow in one output column: Purpose.We want as well to have a new column in the output schema called Path. And finally, we do not want to have any rating-related information in the output schema.

  5. Click Ok to validate and close the Map Editor.

  6. Right-click tMap and select the regex link to connect tMap to tFileOutputDelimited.

Configuring the output component

  1. Double-click tFileOutputDelimited to display its Basic settings and define its properties.

  2. Click the three-dot button next to the File Name field to browse to the file where you want to write the output data.

  3. Define the row and field separators in the corresponding fields.

  4. Select the Append check box if you want to add the new rows at the end of the records.

  5. Select the Include Header check box to include column headers in the output data.

  6. If needed, click Edit schema to view the input and output data flows.

Job execution

Save your Job an press F6 to execute it.

tFindRegexlibExpressions connects to the web server and collects all regular expressions that match the request, tMap does all defined filed reorganization and concatenation and passes the output flow to tFileOutptdelimited. The output file will look something like the following:

You can later import all collected regular expressions from a well formatted csv file into Talend Studio. for more information about importing patterns, see Talend Studio User Guide.