Skip to main content

Talend Data Preparation concepts

These definitions will help you understand the main concepts in Talend Data Preparation.
  • Connection: Connections are environments or systems where datasets are stored, including databases, file systems, distributed systems or platforms, etc. The connection information to these systems only need to be set up once since they are reusable.
  • Dataset: A dataset holds the raw data that can be used as the raw material for one or more preparations. It is presented as a table on which you can apply recipe steps without affecting the original data. A dataset can be reused across preparations.
  • Sample: Your data will be visible in the form of a sample, retrieved from the dataset metadata.
  • Preparation: A preparation is what links a dataset and a recipe together: it is the final outcome that you want to achieve with your data. You can export this outcome as a file or connect it to data targets. A preparation takes one dataset and applies a recipe to produce an outcome. The original dataset is never modified.
  • Recipe: A recipe is literally defined as "a set of directions with a list of ingredients for making or preparing something". In Talend Cloud Data Preparation, the ingredients are the raw data, called datasets, and the directions are the set of functions applied to the dataset. Visually, the recipe is the top-down sequence of functions in the left collapsible panel. A recipe is linked to the dataset through a preparation. Every update of the recipe is automatically saved in the preparation all the time.
  • Function: A function is an action applied on a row, a column or the whole dataset such as removing empty rows. As functions are applied as part of a preparation, they do not modify the original data. Applied functions are recorded, in sequence, into recipes.
  • Semantic type: The semantic type of a column or record corresponds to the type of data that can be found in it, such as names, zip codes, phone numbers, coordinates, etc. The Talend Cloud applications all benefit from semantic awareness, meaning that when you look at your sample data, it will be automatically categorized using the default semantic types, or the ones that you have created yourself.
  • Cloud Engine for Design: The Cloud Engine for Design is a built-in runner that allows users to easily process data without having to set up any processing engines. With this engine you can run two objects in parallel. For advanced processing of data it is recommended to install the secure Remote Engine Gen2.
  • Remote Engine Gen2: A Remote Engine Gen2 is a secure execution engine on which you can safely run objects. It allows you to have control over your execution environment and resources as you are able to create and configure the engine in your own environment (Virtual Private Cloud or on premises).

    A Remote Engine ensures:

    • Data processing in a safe and secure environment as Talend never has access to your data and resources.
    • Optimal performance and security by increasing the data locality instead of moving large data to computation.

Relationship between connections, datasets, and preparations:

Relationship between connections, datasets, and preparations illustrated.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!