Preparing an HDFS-based dataset

Talend Data Preparation Quick Examples

author
Talend Documentation Team
EnrichVersion
6.4
2.1
EnrichProdName
Talend MDM Platform
Talend Real-Time Big Data Platform
Talend Data Services Platform
Talend Big Data
Talend Data Management Platform
Talend Data Fabric
Talend ESB
Talend Data Integration
Talend Big Data Platform
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation
When using Talend Data Preparation in a big data context, you can access data stored on HDFS (Hadoop File System).

In this example, you work for a worldwide online video streaming company. You will retrieve some customer information stored on a cluster, create a dataset in Talend Data Preparation, apply various preparation steps to cleanse and enrich this data, and then export it back on the cluster with a new format.

Through the use of the Components Catalog service, the data is not physically stored on the Talend Data Preparation server, but rather fetched on-demand from the cluster. Only a sample is retrieved and display in the Talend Data Preparation interface for you to work on.

To use Talend Data Preparation in a Big Data context, you must fulfill the following prerequisites:

  • The Components Catalog service is installed and running on a Windows or Linux machine.
  • The Spark Job Server is installed and running on a Linux machine.
  • The Streams Runner is installed and running on a Linux machine.