Regular Expressions - 8.0

Talend Data Preparation User Guide

Version
8.0
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Preparation
Content
Data Quality and Preparation > Cleansing data
Last publication date
2024-03-26

Regular expressions (or regex) are advanced search strings that allows you to match complex patterns.

In this documentation, the regular expression elements are classified by category.

All the examples listed are used with the two following lines:

Comment from happy_user@company.com (04-Apr-2016):

I love working with Talend Data Preparation! It really helps me with all my daily tasks!

Regular Expressions Examples

Regular Expression Matches
\bTa Talend
\bw\w* working, with
\w+n\b Preparation
Talend\s\w+\s\w+ Talend Data Preparation
task(s?) tasks (it would also match "task")
\w+@\w+.com happy_user@company.com
\d{2}-.*-\d+ 04-Apr-2016

Anchors

Character Matches Example
^ Start of string, or start of line in a multi-line pattern ^Comment matches "Comment" at the beginning of the line.

^C.* matches the first line.

$ End of string, or end of line in a multi-line pattern !$ matches the last exclamation mark.
\b Word boundary \bwo matches the "wo" in "working".

\bwo\w+ matches "working".

ng\b matches the "ng" in "working".

\w+ng\b matches "working".

\B Not word boundary \Bh matches the final "h" in "with" but not the "h" in "helps" or "happy".

h\B matches the first "h" in "helps" and "happy" but not the final one in "with".

Character Classes

Character Matches Example
. Any character, except new line (\n) . matches all the characters in the text, except for the carriage return.
\s White space Talend\sData matches "Talend Data".

Data\s+Preparation matches "Data Preparation".

\S Not white space \S matches all the characters in the sentence, except for the spaces.
\d Digit \d{4} matches "2016".
\D Not digit \D matches all the characters in the text but not the numbers.
\w Word character and underscore T\w+matches "Talend".
\W Not word company\Wcom matches "company.com".
\n New line .*\n.* matches the whole text.

Escape Characters

Character Matches
\. .
\\ \
\+ +
\* *
\? ?
\$ $
\[ [
\] ]
\{ {
\} }
\( (
\) )
\| |
\/ /

Groups and Ranges

Character

Matches

Example
() Group m(e|y) matches "me" and "my".
(a|b) a or b m(e|y) matches "me" (in "Comment"), "me" and "my".
[abc] Range (a or b or c) m[ey] matches "me" (in "Comment"), "me" and "my".
[a-q] Letter from a to q m[a-m] matches "me" (in "Comment") and "me" but not "my".
[0-7] Digit from 0 to 7 201[0-5] does not match "2016" but would match all years between "2010" and "2015".

The expression captured in a group can be reused using the $ symbol. When more than one group is captured, add a number to the $ symbol, so that it corresponds to the order in which they were captured.

For example, you want to reformulate the expression Y16Q02 that can be matched by the regular expression Y(\d{2})Q(\d{2}). You can then reformulate your original expression only keeping the characters you have captured. If you want your new expression to be Quarter 02 of year 2016, the new regular expression Quarter $2 of year 20$1 will match it.

Quantifiers

Character Matches Examples
* 0 or more work\w* matches "working" but also "work" and "works".
+ 1 or more work\w+ matches "working" but also "works". However, it does not match "work".
? 0 or 1 work(s?) matches "work" and "works" but not "working".
{3} Exactly 3 20\d{2} matches "2016" and other numbers between "2000" and "2099".
{3,} 3 or more 20\d{2,} matches "2016" and all numbers superiors or equal to "2000" starting by "20".
{3,5} 3, 4 or 5 20{1,2} matches "2016" and all numbers from "200" to "2099".
[0-7] Digit from 0 to 7 201[0-9] matches "2016" and all numbers from "2010" to "2019".