When using Talend Data Preparation in a big data context,
you can access data stored on HDFS (Hadoop File System).
In this example, you work for a worldwide online video streaming company. You will retrieve some customer information stored on a cluster, create a dataset in Talend Data Preparation, apply various preparation steps to cleanse and enrich this data, and then export it back on the cluster with a new format.
Through the use of the Components Catalog service, the data is not physically stored on the Talend Data Preparation server, but rather fetched on-demand from the cluster. Only a sample is retrieved and display in the Talend Data Preparation interface for you to work on.
To use Talend Data Preparation in a Big Data context, you must fulfill the following prerequisites:
- The Components Catalog service is installed and running on a Windows or Linux machine.
- The Spark Job Server is installed and running on a Linux machine.
- The Streams Runner is installed and running on a Linux machine.