Skip to main content Skip to complementary content
Close announcements banner

Filtering customer data based on valid and invalid semantic types

A pipeline with a source dataset, a Field Selector processor, a Semantic filter processor, and two destinations.

Before you begin

  • You have previously created a connection to the system storing your source data.

    Here, a Test connection.

  • You have previously added the dataset holding your source data.

    Download and extract the file: semantic_filter-customers.zip. It contains a list of customers with raw data that you can find attached to this document.

  • You also have created the connection and the related dataset that will hold the processed data.

    Here the files will also be stored in two Test datasets.

Procedure

  1. Click Add pipeline on the Pipelines page. Your new pipeline opens.
  2. Give the pipeline a meaningful name.

    Example

    Filtering customer data based on semantic type
  3. Click ADD SOURCE to open the panel allowing you to select your source data, here a list of customers with raw data (inconsistent field case, empty fields, etc.) and pre-discovered semantic types.

    Example

    Preview of a data sample about customers with pre-discovered semantic types.
  4. Select your dataset and click Select in order to add it to the pipeline.
    Rename it if needed.
  5. Click Plus and add a Field selector processor to the pipeline. The configuration panel opens.
  6. Give a meaningful name to the processor.

    Example

    restructure fields
  7. From the Configuration tab:
    1. Click the Edit icon in the Simple selection mode to open the tree view that will allow you to select and rename the fields you want to keep.
    2. Select the following fields in the tree view: ID, FIRSTNAME, LASTNAME, STATE, company_name and EMAIL.
    3. Click the Rename icon next to them and rename them respectively: ID, Firstname, Lastname, State, CompanyName and Email.
  8. Click Save to save your configuration.

    Look at the preview of the processor to compare your data before and after the selecting and renaming operation.

    Preview of the Field selector processor after reorganizing the customer records.
  9. Click Plus and add a Semantic filter processor to the pipeline. The Configuration panel opens.
  10. Give a meaningful name to the processor.

    Example

    filter on valid US phones and emails
  11. In the Filters area:
    1. Select .PhoneNumber in the Input list, as you want to filter this field according to the semantic type associated to it: Phone numbers.
    2. Select Valid in the Keep only list, as you want to keep the valid values after matching them against phone number semantic types.
    3. Add another filter and select .Email in the Input list, as you want to filter this field according to the semantic type associated to it: Email.
    4. Select Valid in the Keep only list, as you want to keep the valid values after matching them against Email semantic types.
  12. Click Save to save your configuration.

    Look at the preview of the processor to compare your data before and after the filtering operation: you can see that one record has an invalid email value (missing the @ character in the email address) and two records have invalid phone number values (missing digits) when matching them with their semantic types.

    Preview of the Semantic filter processor after filtering on valid phone and email records.
  13. Click the ADD DESTINATION item after the Semantic filter processor and select the dataset that will hold the data that matches the filter criteria: the data with valid values.
    Rename it if needed.
  14. Click the Doesn't match filter button on the Semantic filter processor and click the ADD DESTINATION item to select the dataset that will hold your rejected data: the data with invalid values.
  15. Give a meaningful name to the Destination.

    Example

    invalid customer data
  16. On the top toolbar of Talend Cloud Pipeline Designer, click the Run button to open the panel allowing you to select your run profile.
  17. Select your run profile in the list (for more information, see Run profiles), then click Run to run your pipeline.

Results

Your pipeline is being executed, the data is filtered according to the semantic types you have selected and the output flows are sent to the destinations you have indicated.

What to do next

Alternatively, you can send your invalid records to a Data Stewardship campaign destination. This will allow data stewards to review and correct invalid data.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!