Regular Expressions

Talend Data Preparation User Guide

author
Talend Documentation Team
EnrichVersion
6.3
2.0
EnrichProdName
Talend Data Fabric
Talend Real-Time Big Data Platform
Talend Big Data Platform
Talend Big Data
Talend MDM Platform
Talend Data Integration
Talend Data Services Platform
Talend Data Management Platform
Talend ESB
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation

Regular expressions (or regex) are advanced search strings that allows you to match complex patterns.

In this documentation, the regular expression elements are classified by category.

All the examples listed are used with the two following lines:

Comment from happy_user@company.com (04-Apr-2016):

I love working with Talend Data Preparation! It really helps me with all my daily tasks!

Regular Expressions Examples

Regular Expression

Matches

\bTa

Talend

\bw\w*

working, with

\w+n\b

Preparation

Talend\s\w+\s\w+

Talend Data Preparation

task(s?)

tasks (it would also match "task")

\w+@\w+.com

happy_user@company.com

\d{2}-.*-\d+

04-Apr-2016

Anchors

Character

Matches

Example

^

Start of string, or start of line in a multi-line pattern

^Comment matches "Comment" at the beginning of the line.

^C.* matches the first line.

$

End of string, or end of line in a multi-line pattern

!$ matches the last exclamation mark.

\b

Word boundary

\bwo matches the "wo" in "working".

\bwo\w+ matches "working".

ng\b matches the "ng" in "working".

\w+ng\b matches "working".

\B

Not word boundary

\Bh matches the final "h" in "with" but not the "h" in "helps" or "happy".

h\B matches the first "h" in "helps" and "happy" but not the final one in "with".

Character Classes

Character

Matches

Example

.

Any character, except new line (\n)

. matches all the characters in the text, except for the carriage return.

\s

White space

Talend\sData matches "Talend Data".

Data\s+Preparation matches "Data Preparation".

\S

Not white space

\S matches all the characters in the sentence, except for the spaces.

\d

Digit

\d{4} matches "2016".

\D

Not digit

\D matches all the characters in the text but not the numbers.

\w

Word character and underscore

T\w+matches "Talend".

\W

Not word

company\Wcom matches "company.com".

\n

New line

.*\n.* matches the whole text.

Escape Characters

Character

Matches

\.

.

\\

\

\+

+

\*

*

\?

?

\$

$

\[

[

\]

]

\{

{

\}

}

\(

(

\)

)

\|

|

\/

/

Groups and Ranges

Character

Matches

Example

()

Group

m(e|y) matches "me" and "my".

(a|b)

a or b

m(e|y) matches "me" (in "Comment"), "me" and "my".

[abc]

Range (a or b or c)

m[ey] matches "me" (in "Comment"), "me" and "my".

[a-q]

Letter from a to q

m[a-m] matches "me" (in "Comment") and "me" but not "my".

[0-7]

Digit from 0 to 7

201[0-5] does not match "2016" but would match all years between "2010" and "2015".

The expression captured in a group can be reused using the $ symbol. When more than one group is captured, add a number to the $ symbol, so that it corresponds to the order in which they were captured.

For example, you want to reformulate the expression Y16Q02 that can be matched by the regular expression Y(\d{2})Q(\d{2}). You can then reformulate your original expression only keeping the characters you have captured. If you want your new expression to be Quarter 02 of year 2016, the new regular expression Quarter $2 of year 20$1 will match it.

Quantifiers

Character

Matches

Examples

*

0 or more

work\w* matches "working" but also "work" and "works".

+

1 or more

work\w+ matches "working" but also "works". However, it does not match "work".

?

0 or 1

work(s?) matches "work" and "works" but not "working".

{3}

Exactly 3

20\d{2} matches "2016" and other numbers between "2000" and "2099".

{3,}

3 or more

20\d{2,} matches "2016" and all numbers superiors or equal to "2000" starting by "20".

{3,5}

3, 4 or 5

20{1,2} matches "2016" and all numbers from "200" to "2099".

[0-7]

Digit from 0 to 7

201[0-9] matches "2016" and all numbers from "2010" to "2019".