Regular Expressions

Regular Expressions - 8.0

Talend Data Preparation User Guide

Version

8.0

Language

English

Product

Talend Big Data

Talend Big Data Platform

Talend Data Fabric

Talend Data Integration

Talend Data Management Platform

Talend Data Services Platform

Talend ESB

Talend MDM Platform

Talend Real-Time Big Data Platform

Module

Talend Data Preparation

Content

Data Quality and Preparation > Cleansing data

Last publication date

2024-03-26

Regular expressions (or regex) are advanced search strings that allows you to match complex patterns.

In this documentation, the regular expression elements are classified by category.

All the examples listed are used with the two following lines:

Comment from happy_user@company.com (04-Apr-2016):

I love working with Talend Data Preparation! It really helps me with all my daily tasks!

Regular Expressions Examples

Regular Expression	Matches
`\bTa`	Talend
`\bw\w*`	working, with
`\w+n\b`	Preparation
`Talend\s\w+\s\w+`	Talend Data Preparation
`task(s?)`	tasks (it would also match "task")
`\w+@\w+.com`	happy_user@company.com
`\d{2}-.*-\d+`	04-Apr-2016

Anchors

Character	Matches	Example
`^`	Start of string, or start of line in a multi-line pattern	`^Comment` matches "Comment" at the beginning of the line. `^C.*` matches the first line.
`$`	End of string, or end of line in a multi-line pattern	`!$` matches the last exclamation mark.
`\b`	Word boundary	`\bwo` matches the "wo" in "working". `\bwo\w+` matches "working". `ng\b` matches the "ng" in "working". `\w+ng\b` matches "working".
`\B`	Not word boundary	`\Bh` matches the final "h" in "with" but not the "h" in "helps" or "happy". `h\B` matches the first "h" in "helps" and "happy" but not the final one in "with".

Character Classes

Character	Matches	Example
`.`	Any character, except new line (\n)	`.` matches all the characters in the text, except for the carriage return.
`\s`	White space	`Talend\sData` matches "Talend Data". `Data\s+Preparation` matches "Data Preparation".
`\S`	Not white space	`\S` matches all the characters in the sentence, except for the spaces.
`\d`	Digit	`\d{4}` matches "2016".
`\D`	Not digit	`\D` matches all the characters in the text but not the numbers.
`\w`	Word character and underscore	`T\w+`matches "Talend".
`\W`	Not word	`company\Wcom` matches "company.com".
`\n`	New line	`.\n.` matches the whole text.

Escape Characters

Character	Matches
`\.`	.
`\\`	\
`\+`	+
`\*`	*
`\?`	?
`\$`	$
`\[`	[
`\]`	]
`\{`	{
`\}`	}
`\(`	(
`\)`	)
`\\|`	\|
`\/`	/

Groups and Ranges

Character	Matches	Example
`()`	Group	`m(e\|y)` matches "me" and "my".
`(a\|b)`	a or b	`m(e\|y)` matches "me" (in "Comment"), "me" and "my".
`[abc]`	Range (a or b or c)	`m[ey]` matches "me" (in "Comment"), "me" and "my".
`[a-q]`	Letter from a to q	`m[a-m]` matches "me" (in "Comment") and "me" but not "my".
`[0-7]`	Digit from 0 to 7	`201[0-5]` does not match "2016" but would match all years between "2010" and "2015".

The expression captured in a group can be reused using the $ symbol. When more than one group is captured, add a number to the $ symbol, so that it corresponds to the order in which they were captured.

For example, you want to reformulate the expression Y16Q02 that can be matched by the regular expression Y(\d{2})Q(\d{2}). You can then reformulate your original expression only keeping the characters you have captured. If you want your new expression to be Quarter 02 of year 2016, the new regular expression Quarter $2 of year 20$1 will match it.

Quantifiers

Character	Matches	Examples
`*`	0 or more	`work\w*` matches "working" but also "work" and "works".
`+`	1 or more	`work\w+` matches "working" but also "works". However, it does not match "work".
`?`	0 or 1	`work(s?)` matches "work" and "works" but not "working".
`{3}`	Exactly 3	`20\d{2}` matches "2016" and other numbers between "2000" and "2099".
`{3,}`	3 or more	`20\d{2,}` matches "2016" and all numbers superiors or equal to "2000" starting by "20".
`{3,5}`	3, 4 or 5	`20{1,2}` matches "2016" and all numbers from "200" to "2099".
`[0-7]`	Digit from 0 to 7	`201[0-9]` matches "2016" and all numbers from "2010" to "2019".