Glossary

Talend Real-time Big Data Platform Getting Started Guide

EnrichVersion
6.3
EnrichProdName
Talend Cloud
Talend Real-Time Big Data Platform
task
Installation and Upgrade
Deployment
Design and Development
Data Quality and Preparation
Administration and Monitoring
EnrichPlatform
Talend Administration Center
Talend CommandLine
Talend ESB
Talend Installer
Talend DQ Portal
Talend Runtime
Talend Studio

When working with Talend Studio and in order to understand its functional mechanism, it is important to understand some basic vocabulary.

component

A component is an executable part of a Job or Route used to connect to an external source or perform a specific data integration operation, no matter what data sources you are integrating: databases, applications, flat files, Web services, etc. A component can minimize the amount of hand-coding required to work on data from multiple, heterogeneous sources.

Components are grouped in families according to their usage and displayed in the Palette of the Integration perspective of Talend Studio.

For detailed information about components types and what they can be used for, see Talend Components Reference Guide.

item

An item is the fundamental technical unit in a project. Items are grouped, according to their types, as: Job Design, Business model, Context, Code, Metadata, etc. One item can include other items. For example, the business models and the Jobs you design are items, metadata and routines you use inside your Jobs are items as well.

Job

A Job is a graphical design, of one or more components connected together, that allows you to set up and run dataflow management processes. It translates business needs into code, routines and programs. Jobs address all of the different sources and targets that you need for data integration processes and all other related processes.

Joblet

A Joblet is a specific component that replaces Job component groups. It factorizes recurrent processing or complex transformation steps to ease the reading of a complex Job. Joblets can be reused in different Jobs or several times in the same Job.

metadata

Metadata is information that describes the characteristics of any data object, such as its name, type, location, author, date created, size, and so on, together with relationships with other data objects that the enterprise has to manage or that an IT tool may generate. Metadata can be created manually or automatically by a system.

project

Projects are structured collections of items and their associated metadata. All of the Jobs and business models you design are organized in Projects.

repository

A repository is the storage location Talend Studio uses to gather data related to all of the technical items that you use either to describe business models or to design Jobs.

Talend Studio can connect to as many local or remote repositories as needed.

Route

A Camel Route is a graphical design, based on Apache Camel framework, of two or more components connected together that allows you to set up and run routing and mediation rules. A routing rule defines how messages will be moved from one service (or endpoint) to another.

Service

A Service is a graphical design, of several WSDL objects (service, binding, port type and so on) linked together, that allows you to set up and implement Web services. A Service is associated with one or more data service Jobs as the service provider and can be consumed by consumer Jobs.

service Job

A data service Job is a graphical design, of one or more components connected together, that allows you to set up and run data service processes. It translates business needs into code, routines and programs. Jobs address all of the different sources and targets that you need for data integration processes and combine it with Web services.

Note

Data service Jobs will simply be referred to as Jobs in the following documentation.

workspace

A workspace is the directory where you store all your project folders. You need to have one workspace directory per connection (repository connection). Talend Studio enables you to connect to different workspace directories, if you do not want to use the default one.

Terms in Talend Big Data

Big Data Batch Job

A Big Data Batch Job can be a Talend MapReduce Job or a Talend Spark Job depending on the framework you choose to use when you are creating a Job.

This type of Job is available only if you have subscribed to one of the Talend solutions with Big Data.

Big Data Streaming Job

A Big Data Streaming Job can be a Talend Spark Streaming Job or a Talend Storm Job depending on the framework you choose to use when you are creating.

The component in this framework is available only if you have subscribed to Talend Real-time Big Data Platform or Talend Data Fabric.

Standard Job

A Talend Standard Job is a traditional Talend data integration Job that runs a classic ETL or ELT process.

Hadoop cluster metadata

A Hadoop cluster metadata is information that describes the characteristics of the connection to a given Hadoop cluster.

Spark Job

A Spark Job is a Talend Job that runs on top of Spark to create and process RDDs.

MapReduce Job

A MapReduce Job is a Talend Job that runs on top of the MapReduce framework. The number of mappers and reducers that a MapReduce Job generates depends on how you design this MapReduce Job.

Terms in Talend Data Quality

advanced statistics

Indicators which determine the most probable and the most frequent values and build frequency tables.

Benford Law Frequency

An indicator based on examining the actual frequency of the digits 1 through 9 in numerical data. It is usually used as an indicator of accounting and expenses fraud in lists or tables.

data profiling

The process of examining the data available in different data sources and collecting statistics and information about this data. Data profiling helps to assess the quality level of the data according to a defined goal.

Data Quality Portal

A web-based platform that shares the results of the analyses and further exploits them.

It provides advanced reporting and allows to compare current and historical statistics to determine the improvement or degradation of your data.

indicators

Results achieved through the implementation of complex analyses about data matching and other data-related operations.

They fall into two categories: "system indicators" or "user defined indicators".

patterns

Sets of strings against which you can define the content, structure and quality of highly complex data.

They fall into two categories: "regular expressions" or "SQL patterns".

pattern frequency statistics

Indicators which determine the most and less frequent patterns in a data set.

phone number statistics

Indicators which count phone numbers. They return the count for each phone number format. They validate the phone formats using the org.talend.libraries.google.libphonumber library.

regular expressions (regex)

Predefined patterns that you can use to search and manipulate data in databases.

report

A document you can generate on one or more analyses from the Profiling perspective of the Studio to provide the statistics collected by the analyses. You can generate reports in different formats.

simple statistics

Indicators which provide simple statistics on the number of records falling in certain categories including the number of rows, the number of null values, the number of distinct and unique values, the number of duplicates, or the number of blank fields.

soundex frequency statistics

Indicators which use the Soundex algorithm built in the DBMS. They index records by sounds. This way, records with the same pronunciation (only English pronunciation) are encoded to the same representation so that they can be matched despite minor differences in spelling.

SQL patterns

Personalized patterns which you can use in SQL queries. These patterns usually contain the percent sign (%).

summary statistics

Indicators which perform statistical analyses on numeric data, including the computation of location measures such as the median and the average, the computation of statistical dispersions such as the inter quartile range and the range.

text statistics

Indicators which analyze the characteristics of textual fields in the columns, including minimum, maximum and average length.