Talend Data Preparation concepts - 7.3

Talend Data Preparation User Guide

Talend Documentation Team
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Data Quality and Preparation > Cleansing data
Talend Data Preparation
These definitions will help you understand the main concepts in Talend Data Preparation.
  • Dataset: A dataset holds the raw data that can be used as the raw material for one or more preparations. It is presented as a table on which you can apply recipe steps without affecting the original data. A dataset can be reused across preparations.
  • Preparation: A preparation is what links a dataset and a recipe together: it is the final outcome that you want to achieve with your data. You can export this outcome as a file or connect it to data targets. A preparation takes one dataset and applies a recipe to produce an outcome. The original dataset is never modified.
  • Recipe: A recipe is literally defined as "a set of directions with a list of ingredients for making or preparing something". In Talend Data Preparation, the ingredients are the raw data, called datasets, and the directions are the set of functions applied to the dataset. Visually, the recipe is the top-down sequence of functions in the left collapsible panel. A recipe is linked to the dataset through a preparation. Every update of the recipe is automatically saved in the preparation all the time.
  • Function: A function is an action applied on a row, a column or the whole dataset such as removing empty rows. As functions are applied as part of a preparation, they do not modify the original data. Applied functions are recorded, in sequence, into recipes.
  • Semantic type: The semantic type of a column or record corresponds to the type of data that can be found in it, such as names, zip codes, phone numbers, coordinates, etc. The Talend applications all benefit from semantic awareness, meaning that when you look at your sample data, it will be automatically categorized using the default semantic types, or the ones that you have created yourself.