Operationalizing a recipe in a Talend Spark Batch or Streaming Job

Talend Data Preparation User Guide

author
Talend Documentation Team
EnrichVersion
6.4
2.1
EnrichProdName
Talend Real-Time Big Data Platform
Talend Big Data Platform
Talend MDM Platform
Talend Data Fabric
Talend ESB
Talend Big Data
Talend Data Services Platform
Talend Data Integration
Talend Data Management Platform
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation

The tDataprepRun component allows you to reuse an existing preparation made in Talend Data Preparation, directly in a Big Data Job.

In other words, you can operationalize the process of applying a preparation to input data with the same model, in a Spark Streaming or Spark Batch Job.

Let's take the example of a simple Job that :

  • Reads customer data from a .csv file on HDFS,
  • applies an existing preparation on this data,
  • outputs it in a Hive database.

This assumes that a preparation has been created beforehand, on a dataset with the same schema as your input data for the Job. In this case, the existing preparation is called datapreprun_spark. This preparation was made on a dataset containing data about customers from around the world, including their names, email addresses, a subscription date, and the country they live in. This simple preparation applies a filter on the data to only keep customers from China and Russia, harmonizes the date format, and extracts the email parts.

Note that if a preparation contains actions that only affect a single row, or cells, they will be skipped by the tDataprepRun component during the job. The Make as header or Delete Row functions for example do not work in a Big Data context.

Procedure

  1. In Talend Studio, create a new Spark Batch or Spark Streaming Job.
  2. In the design workspace, add a tHDFSConfiguration, a tFileInputDelimited, a tDataprepRun and a tHiveOutput component.
  3. Link the tFileInputDelimited, tDataprepRun and tHiveOutput together using two Row > Main links.
  4. Select the tHDFSConfiguration component and click the Run tab to configure the Spark Configuration tab.

    For more information on how to configure the tHDFSConfiguration component, see tHDFSConfiguration properties for Apache Spark Batch or tHDFSConfiguration properties for Apache Spark Streaming.

  5. Select the tFileInputDelimited component and click the Component tab to configure its basic settings.

    Make sure that the schema of the tFileInputDelimited component matches the schema expected by the tDataprepRun component. In other words, the input schema must be the same as the dataset upon which the datapreprun_spark preparation was made in the first place.

  6. Select the tDataprepRun component and click the Component tab to define its basic settings.
  7. In the URL field, type the URL of the Talend Data Preparation Web application.

    Port 9999 is the default port for Talend Data Preparation.

  8. In the Username and Password fields, enter your Talend Data Preparation connection information, between double quotes.
  9. Click Choose an existing preparation to display a list of the preparations available in Talend Data Preparation, and select datapreprun_spark.

    A warning is displayed next to preparations containing incompatible actions, that only affect a single row or cell.

  10. Click Fetch Schema to retrieve the schema of the preparation, datapreprun_spark in this example.

    The output schema of the tDataprepRun component now reflects the changes made with each preparation step. The schema takes into account columns that were added or removed for example.

  11. Select the tHiveOutput component and click the Component tab to define its basic settings.
  12. Click Sync columns to retrieve the new output schema, inherited from the tDataprepRun component
  13. Save your Job and press F6 to run it.

Results

All the preparation steps of datapreprun_spark have been applied to your data, directly in the flow of your data integration Job.