Running a preparation on Google Data Flow

Talend Data Preparation User Guide

author
Talend Documentation Team
EnrichVersion
6.4
2.1
EnrichProdName
Talend Real-Time Big Data Platform
Talend Big Data Platform
Talend MDM Platform
Talend Data Fabric
Talend ESB
Talend Big Data
Talend Data Services Platform
Talend Data Integration
Talend Data Management Platform
task
Data Quality and Preparation > Cleansing data
EnrichPlatform
Talend Data Preparation

You can chose to set Google Cloud Dataflow as Big Data export runtime for your preparations.

Warning: This is a technical preview and no support is available for this feature.

To configure this new runtime instead of the default one, you must perform some Streams Runner and Spark Job Server configuration.

Before you begin

  1. You have a Google Cloud enterprise account and have created a Google Cloud project.
  2. You have installed Talend Data Preparation.
  3. You have installed Streams Runner and Spark Job Server on Linux machines.
  4. You have created a service account on Google Cloud and downloaded the .json file containing the credentials for this service account. This file must be stored on the same machine where the Spark Job Server was installed. The service account must have the right to run Jobs on Google Cloud Dataflow and access buckets involved in your Jobs in Google Cloud Storage, such as your input and output buckets, as well as the bucket set for tempLocation.

Procedure

  1. Open the <Streams_Runner_installation_path>/conf/application.conf file.
  2. To set Google Dataflow as runner type, you can either:
    • Set DataflowRunner as value for the runner.type parameter.
    • Use the $(?RUNNER_TYPE) environment variable by executing the following command: export RUNNER_TYPE=DataflowRunner
  3. Configure the runner properties by adding the two mandatory parameter and their values to the configuration file, namely project and tempLocation.

    In addition to these two parameters, you can complete the runner configuration with other parameters of your choice. For a complete list of the available execution parameters, see the Google documentation.

  4. To configure the Spark Job Server, add the GOOGLE_APPLICATION_CREDENTIALS environment variable by executing the following command: export GOOGLE_APPLICATION_CREDENTIALS=<path_to_service_account_file>

    The variable must point to the .json file that contains the credentials for your Google Cloud service account. This .json file must be located on the machine where the Spark Job Server is installed.

  5. Restart the services.

Results

When exporting a preparation, the Google Cloud Dataflow runtime will be used instead of the default Big Data runtime, depending on the data input and output. For more information on which runtime will be used according to your input and output, see Export options and runtimes matrix.