Skip to main content Skip to complementary content

Selecting records of deduplicated Tate gallery artists

A pipeline with a source, a Field selector processor, and a destination.

Before you begin

  • You have previously added the dataset holding your source data.

    Download and extract the file: field_selector-artists.zip. It contains a dataset of artists of the Tate galleries in London (including their name, date of birth, URL of their Tate page, etc.) with some duplicate names.

  • You also have created the connection and the related dataset that will hold the processed data.

    Here, a file stored on a Test connection.

Procedure

  1. Click Add pipeline on the Pipelines page. Your new pipeline opens.
  2. Give the pipeline a meaningful name.

    Example

    Select deduplicated artists
  3. Click ADD SOURCE to open the panel allowing you to select your source data, here a list of Tate artists with some duplicates.
    Preview of a data sample with Tate artist records.
  4. Select your dataset and click Select in order to add it to the pipeline.
    Rename it if needed.
  5. Click Plus and add a Field selector processor to the pipeline. The configuration panel opens.
  6. Give a meaningful name to the processor.

    Example

    select fields with distinct
  7. Enable the Distinct option in order to only return fields with different values and get rid of the duplicates.
  8. Click the Edit icon in the Simple mode to open the Select fields window:
    1. Select name in the Input list and enter full_name in the Output list, as you want to select and rename the fields related to the artists names.
    2. Select yearOfBirth in the Input list and year_of_birth in the Output list, as you want to select and rename the fields related to the artist years of birth.
    3. Select yearOfDeath in the Input list and enter year_of_death in the Output list, as you want to select and rename the fields related to the artist years of death.
      The Field selector configuration panel shows 3 selected fields with the Distinct option enabled.
  9. Click Save to save your configuration.

    Look at the preview of the processor to compare your data before and after the select and distinct operations. The artists names are deduplicated and only the fields with different values are returned.

    Preview of the Field selector processor after deduplicating records.
  10. Click ADD DESTINATION and select the dataset that will hold your reorganized data.
    Rename it if needed.
  11. On the top toolbar of Talend Cloud Pipeline Designer, click the Run button to open the panel allowing you to select your run profile.
  12. Select your run profile in the list (for more information, see Run profiles), then click Run to run your pipeline.

Results

Your pipeline is being executed, the data is reorganized according to the conditions you have stated and the output is sent to the target system you have indicated.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!