Filtering customer data based on valid and invalid semantic types

A pipeline with a source dataset, a Field Selector processor, a Semantic filter processor, and two destinations.

Before you begin

You have previously created a connection to the system storing your source data.

Here, a Test connection.
You have previously added the dataset holding your source data.

Download and extract the file: semantic_filter-customers.zip. It contains a list of customers with raw data that you can find attached to this document.
You also have created the connection and the related dataset that will hold the processed data.

Here the files will also be stored in two Test datasets.

Procedure

Click Add pipeline on the Pipelines page. Your new pipeline opens.
Give the pipeline a meaningful name.
Example
Filtering customer data based on semantic type
Click ADD SOURCE to open the panel allowing you to select your source data, here a list of customers with raw data (inconsistent field case, empty fields, etc.) and pre-discovered semantic types.
Example
Select your dataset and click Select in order to add it to the pipeline.
Rename it if needed.
Click and add a Field selector processor to the pipeline. The configuration panel opens.
Give a meaningful name to the processor.
Example
restructure fields
From the Configuration tab:
1. Click the icon in the Simple selection mode to open the tree view that will allow you to select and rename the fields you want to keep.
2. Select the following fields in the tree view: ID, FIRSTNAME, LASTNAME, STATE, company_name and EMAIL.
3. Click the icon next to them and rename them respectively: ID, Firstname, Lastname, State, CompanyName and Email.
Click Save to save your configuration.

Look at the preview of the processor to compare your data before and after the selecting and renaming operation.
Click and add a Semantic filter processor to the pipeline. The Configuration panel opens.
Give a meaningful name to the processor.
Example
filter on valid US phones and emails
In the Filters area:
1. Select .PhoneNumber in the Input list, as you want to filter this field according to the semantic type associated to it: Phone numbers.
2. Select Valid in the Keep only list, as you want to keep the valid values after matching them against phone number semantic types.
3. Add another filter and select .Email in the Input list, as you want to filter this field according to the semantic type associated to it: Email.
4. Select Valid in the Keep only list, as you want to keep the valid values after matching them against Email semantic types.
Click Save to save your configuration.

Look at the preview of the processor to compare your data before and after the filtering operation: you can see that one record has an invalid email value (missing the @ character in the email address) and two records have invalid phone number values (missing digits) when matching them with their semantic types.
Click the ADD DESTINATION item after the Semantic filter processor and select the dataset that will hold the data that matches the filter criteria: the data with valid values.
Rename it if needed.
Click the button on the Semantic filter processor and click the ADD DESTINATION item to select the dataset that will hold your rejected data: the data with invalid values.
Give a meaningful name to the Destination.
Example
invalid customer data
On the top toolbar of Talend Cloud Pipeline Designer, click the Run button to open the panel allowing you to select your run profile.
Select your run profile in the list (for more information, see Run profiles), then click Run to run your pipeline.

Results

Your pipeline is being executed, the data is filtered according to the semantic types you have selected and the output flows are sent to the destinations you have indicated.

What to do next

Alternatively, you can send your invalid records to a Data Stewardship campaign destination. This will allow data stewards to review and correct invalid data.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here

Filtering customer data based on valid and invalid semantic types

Before you begin

Procedure

Example

Example

Example

Example

Example

Results

What to do next

Did this page help you?