Preparing an HDFS-based dataset

Preparing an HDFS-based dataset

EnrichVersion
6.4
2.1
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation
When using Talend Data Preparation in a big data context, you can access data stored on HDFS (Hadoop File System).

In this example, you work for a worldwide online video streaming company. You will retrieve some customer information stored on a cluster, create a dataset in Talend Data Preparation, apply various preparation steps to cleanse and enrich this data, and then export it back on the cluster with a new format.

Through the use of the Components Catalog service, the data is not physically stored on the Talend Data Preparation server, but rather fetched on-demand from the cluster. Only a sample is retrieved and display in the Talend Data Preparation interface for you to work on.

To use Talend Data Preparation in a Big Data context, you must fulfill the following prerequisites:

  • The Components Catalog service is installed and running on a Windows or Linux machine.
  • The Spark Job Server is installed and running on a Linux machine.
  • The Streams Runner is installed and running on a Linux machine.