Before you begin
-
You have previously created a connection to the system
storing your source data.
Here, an Amazon S3 connection.
-
You have previously added the dataset holding your source
data.
Download the file: string-crops.csv. It contains a dataset
with data about harvested crops in Mali with crop types, value of production,
harvested areas, etc.
-
You also have created the connection and the related dataset
that will hold the processed data.
Here, a dataset stored in the same S3 bucket.
Procedure
-
Click Add
pipeline on the Pipelines page. Your new pipeline opens.
-
Give the pipeline a meaningful name.
Example
Hash fields to compare data
safely
-
Click ADD SOURCE to open
the panel allowing you to select your source data, here data about harvested
crops in Mali in the year 2005.
Example
-
Select your dataset and click
Select in order to add it to the pipeline.
Rename it if needed.
-
Click and add a Data hashing processor to the pipeline. The
configuration panel opens.
-
Give a meaningful name to the processor.
Example
hash
fields
-
In the Configuration
area:
-
Select Hash
data in the Function
name list.
-
Click the icon next to the Fields to process list to
select all fields, as you want to hash all values at once.
-
Click Save to
save your configuration.
Look at the preview of the processor to compare your data before and after
the operation.
All fields are now hashed and secured, and you can see that
the crop and id fields have
the same output value which means the original value is the same in both
fields.
-
Click and add a Field selector processor to the pipeline.
The configuration panel opens.
-
Give a meaningful name to the processor.
Example
merge identical hashed
values
-
In the Selectors
area:
-
Select .crop in
the Input list and enter
crop_id in the Output list , as you know both the
.crop and .id fields are identical and you want
to merge the two fields.
-
Click the + sign to add a new element and select
.crop_parent in the Input list
and enter crop_type in the Output
list, as you want to keep this field and rename it.
-
Click the + sign to add a new element and select
.harvested_area in the Input
list and enter harvested_area in the
Output list, as you want to keep this field in the
output.
-
Click the + sign to add a new element and select
.value_of_production in the
Input list and enter
production_value in the Output
list, as you want to keep this field and rename it.
-
Click Save to
save your configuration.
Look at the preview of the processor to compare your data before and after
the operation.
-
Click ADD DESTINATION and select the dataset that will hold
your processed data.
Rename it if needed.
-
On the top toolbar of Talend Cloud Pipeline Designer,
click the Run button to open the panel allowing you to select
your run profile.
-
Select your run profile in the list (for more information, see Run profiles), then click Run to
run your pipeline.
Results
Your pipeline is being executed, the data is hashed, identical fields
have been merged and reorganized according to the conditions you have stated and the
output is sent to the target system you have indicated.