Preparing an HDFS-based dataset

Preparing an HDFS-based dataset - 7.3

Talend Data Preparation Examples

Version

7.3

Language

English

Product

Talend Big Data

Talend Big Data Platform

Talend Data Fabric

Talend Data Integration

Talend Data Management Platform

Talend Data Services Platform

Talend ESB

Talend MDM Platform

Talend Real-Time Big Data Platform

Module

Talend Data Preparation

Content

Data Quality and Preparation > Cleansing data

Last publication date

2023-08-07

When using Talend Data Preparation in a big data context, you can access data stored on HDFS (Hadoop File System).

In this example, you work for a worldwide online video streaming company. You will retrieve some customer information stored on a cluster, create a dataset in Talend Data Preparation, apply various preparation steps to cleanse and enrich this data, and then export it back on the cluster with a new format.

Through the use of the Components Catalog service, the data is not physically stored on the Talend Data Preparation server, but rather fetched on-demand from the cluster. Only a sample is retrieved and display in the Talend Data Preparation interface for you to work on.

To use Talend Data Preparation in a Big Data context, you must fulfill the following prerequisites:

The Components Catalog service is installed and running on a Windows or Linux machine.
The Spark Job Server is installed and running on a Linux machine.
The Streams Runner is installed and running on a Linux machine.