Filtering values using patterns - 7.3

Talend Data Preparation User Guide

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Preparation
Content
Data Quality and Preparation > Cleansing data
Last publication date
2023-11-28

The Pattern tab of the profiling area shows a graphical representation of the type and number of characters your data is made of.

In other words, you will be able to see how the records are structured, with either a word, or character granularity. It is also a quick and easy way to apply filter on your data.

When selecting the content of a column, a horizontal bar chart will display the repartition of the different patterns that are used. According to the type of data that you select, the default displayed patterns will be different:

  • Word-based if the column type is text or boolean
  • Character-based if the column type is date or number

But whatever the type of data, you can switch between the character-based or word-based patterns from the Pattern tab.

Analyzing word-based patterns would be an efficient way to detect data quality issues in first names or last names, for example. Names that are not exclusively made of words, with punctuation or numbers, will immediately stand out. On the other hand, character-based patterns would be more suited in the case of structured data, such as client ids or account numbers. You will be able to tell from the chart if the number of characters or digits is not the right one.

This example uses a dataset with typical customer information, such as their names, email, company they work in, or their subscription date.

Procedure

  1. Select a column containing data that you want to filter, email for example.
  2. In the profiling area, click the Pattern tab.
    The different patterns used in this column are displayed in the form of a chart. Because this column uses text data, the chart shows the repartition of the data using word-based patterns.
  3. Switch to the character-based view by clicking the A icon.
    This can give you another point of view to analyze your data.
  4. Switch back to the word-based view by clicking the Text icon.
  5. Click the top bar to apply a filter on the most common pattern.

    The preparation now only displays the rows with the [word]@[word].[word] format.

    You can also use Ctrl + Click or Shift + Click to select multiple values at the same time and apply a more complex filter.

  6. While pressing the Ctrl button, click the bar corresponding to the [word][number]@[word].[word] pattern to add this filter to the previous one.
    The grid now only displays the data corresponding to those two filters.
  7. In the Functions panel, click a function to execute it on the data you filtered, Delete these Filtered Rows for example.
  8. In the filter bar, click the cross in each individual filter or click the garbage bin icon to clear the filters and display the whole dataset again.