Running a preparation on Google Data Flow

Running a preparation on Google Data Flow - 7.3

Talend Data Preparation User Guide

Version

7.3

Language

English

Product

Talend Big Data

Talend Big Data Platform

Talend Data Fabric

Talend Data Integration

Talend Data Management Platform

Talend Data Services Platform

Talend ESB

Talend MDM Platform

Talend Real-Time Big Data Platform

Module

Talend Data Preparation

Content

Data Quality and Preparation > Cleansing data

Last publication date

2023-11-28

You can chose to set Google Cloud Dataflow as Big Data export runtime for your preparations.

Warning: This is a beta feature. No support is available for it.

To configure this new runtime instead of the default one, you must perform some Streams Runner and Spark Job Server configuration.

Before you begin

You have a Google Cloud enterprise account and have created a Google Cloud project.
You have installed Talend Data Preparation.
You have installed Streams Runner and Spark Job Server on Linux machines.
You have created a service account on Google Cloud and downloaded the .json file containing the credentials for this service account. This file must be stored on the same machine where the Spark Job Server was installed. The service account must have the right to run Jobs on Google Cloud Dataflow and access buckets involved in your Jobs in Google Cloud Storage, such as your input and output buckets, as well as the bucket set for tempLocation.

Procedure

Open the <Streams_Runner_installation_path>/conf/application.conf file.
To set Google Dataflow as runner type, you can either:
- Set DataflowRunner as value for the runner.type parameter.
- Use the $(?RUNNER_TYPE) environment variable by executing the following command: export RUNNER_TYPE=DataflowRunner
Configure the runner properties by adding the two mandatory parameter and their values to the configuration file, namely project and tempLocation.

In addition to these two parameters, you can complete the runner configuration with other parameters of your choice. For a complete list of the available execution parameters, see the Google documentation.
To configure the Spark Job Server, add the GOOGLE_APPLICATION_CREDENTIALS environment variable by executing the following command: export GOOGLE_APPLICATION_CREDENTIALS=<path_to_service_account_file>

The variable must point to the .json file that contains the credentials for your Google Cloud service account. This .json file must be located on the machine where the Spark Job Server is installed.
Restart the services.

Results

When exporting a preparation, the Google Cloud Dataflow runtime will be used instead of the default Big Data runtime, depending on the data input and output. For more information on which runtime will be used according to your input and output, see Export options and runtimes matrix.