Preparing an HDFS-based dataset

When using Talend Data Preparation in a big data context, you can access data stored on HDFS (Hadoop File System).

In this example, you work for a worldwide online video streaming company. You will retrieve some customer information stored on a cluster, create a dataset in Talend Data Preparation, apply various preparation steps to cleanse and enrich this data, and then export it back on the cluster with a new format.

Through the use of the Components Catalog service, the data is not physically stored on the Talend Data Preparation server, but rather fetched on-demand from the cluster. Only a sample is retrieved and display in the Talend Data Preparation interface for you to work on.

To use Talend Data Preparation in a Big Data context, you must fulfill the following prerequisites:

The Components Catalog service is installed and running on a Windows or Linux machine.
The Spark Job Server is installed and running on a Linux machine.
The Streams Runner is installed and running on a Linux machine.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here

Preparing an HDFS-based dataset

In this section

Did this page help you?