Hashing fields to compare data safely - Cloud

Talend Cloud Pipeline Designer Processors Guide

Version
Cloud
Language
English
Product
Talend Cloud
Module
Talend Pipeline Designer
Content
Design and Development > Designing Pipelines
Last publication date
2024-02-26

A pipeline with an S3 source, a Data masking processor, a Field selector processor, and an S3 destination.

Before you begin

  • You have previously created a connection to the system storing your source data.

    Here, an Amazon S3 connection.

  • You have previously added the dataset holding your source data.

    Download the file: string-crops.csv. It contains a dataset with data about harvested crops in Mali with crop types, value of production, harvested areas, etc.

  • You also have created the connection and the related dataset that will hold the processed data.

    Here, a dataset stored in the same S3 bucket.

Procedure

  1. Click Add pipeline on the Pipelines page. Your new pipeline opens.
  2. Give the pipeline a meaningful name.

    Example

    Hash fields to compare data safely
  3. Click ADD SOURCE to open the panel allowing you to select your source data, here data about harvested crops in Mali in the year 2005.

    Example

  4. Select your dataset and click Select in order to add it to the pipeline.
    Rename it if needed.
  5. Click Plus and add a Data hashing processor to the pipeline. The configuration panel opens.
  6. Give a meaningful name to the processor.

    Example

    hash fields
  7. In the Configuration area:
    1. Select Hash data in the Function name list.
    2. Click the Open dialog icon next to the Fields to process list to select all fields, as you want to hash all values at once.
      The Data hashing dialog showing the fields to process.
  8. Click Save to save your configuration.

    Look at the preview of the processor to compare your data before and after the operation.

    All fields are now hashed and secured, and you can see that the crop and id fields have the same output value which means the original value is the same in both fields.

    Preview of the Data hashing processor after hashing the crop and ID records.
  9. Click Plus and add a Field selector processor to the pipeline. The configuration panel opens.
  10. Give a meaningful name to the processor.

    Example

    merge identical hashed values
  11. In the Selectors area:
    1. Select .crop in the Input list and enter crop_id in the Output list , as you know both the .crop and .id fields are identical and you want to merge the two fields.
    2. Click the + sign to add a new element and select .crop_parent in the Input list and enter crop_type in the Output list, as you want to keep this field and rename it.
    3. Click the + sign to add a new element and select .harvested_area in the Input list and enter harvested_area in the Output list, as you want to keep this field in the output.
    4. Click the + sign to add a new element and select .value_of_production in the Input list and enter production_value in the Output list, as you want to keep this field and rename it.
  12. Click Save to save your configuration.

    Look at the preview of the processor to compare your data before and after the operation.

    Preview of the Field selector processor after renaming and reorganizing the crop records.
  13. Click ADD DESTINATION and select the dataset that will hold your processed data.
    Rename it if needed.
  14. On the top toolbar of Talend Cloud Pipeline Designer, click the Run button to open the panel allowing you to select your run profile.
  15. Select your run profile in the list (for more information, see Run profiles), then click Run to run your pipeline.

Results

Your pipeline is being executed, the data is hashed, identical fields have been merged and reorganized according to the conditions you have stated and the output is sent to the target system you have indicated.