Preparing an HDFS-based dataset
In this example, you work for a worldwide online video streaming company. You will retrieve some customer information stored on a cluster, create a dataset in Talend Data Preparation, apply various preparation steps to cleanse and enrich this data, and then export it back on the cluster with a new format.
Through the use of the Components Catalog service, the data is not physically stored on the Talend Data Preparation server, but rather fetched on-demand from the cluster. Only a sample is retrieved and display in the Talend Data Preparation interface for you to work on.
To use Talend Data Preparation in a Big Data context, you must fulfill the following prerequisites:
- The Components Catalog service is installed and running on a Windows or Linux machine.
- The Spark Job Server is installed and running on a Linux machine.
- The Streams Runner is installed and running on a Linux machine.
Did this page help you?
If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!