Preparing an HDFS-based dataset

Talend Data Preparation Quick Examples

author
Talend Documentation Team
EnrichVersion
6.5
2.3
EnrichProdName
Talend Data Services Platform
Talend Big Data
Talend Real-Time Big Data Platform
Talend Data Integration
Talend Data Fabric
Talend MDM Platform
Talend Big Data Platform
Talend ESB
Talend Data Management Platform
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation
When using Talend Data Preparation in a big data context, you can access data stored on HDFS (Hadoop File System).

In this example, you work for a worldwide online video streaming company. You will retrieve some customer information stored on a cluster, create a dataset in Talend Data Preparation, apply various preparation steps to cleanse and enrich this data, and then export it back on the cluster with a new format.

Through the use of the Components Catalog service, the data is not physically stored on the Talend Data Preparation server, but rather fetched on-demand from the cluster. Only a sample is retrieved and display in the Talend Data Preparation interface for you to work on.

To use Talend Data Preparation in a Big Data context, you must fulfill the following prerequisites:

  • The Components Catalog service is installed and running on a Windows or Linux machine.
  • The Spark Job Server is installed and running on a Linux machine.
  • The Streams Runner is installed and running on a Linux machine.