tGoogleDataprocManage Standard properties - 7.1

Google Dataproc

EnrichVersion
7.1
EnrichProdName
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Open Studio for Big Data
Talend Real-Time Big Data Platform
EnrichPlatform
Talend Studio
task
Data Governance > Third-party systems > Cloud storages > Google Dataproc components
Data Quality and Preparation > Third-party systems > Cloud storages > Google Dataproc components
Design and Development > Third-party systems > Cloud storages > Google Dataproc components

These properties are used to configure tGoogleDataprocManage running in the Standard Job framework.

The Standard tGoogleDataprocManage component belongs to the Cloud family.

The component in this framework is available in all Talend products with Big Data and in Talend Data Fabric.

Basic settings

Project identifier

Enter the ID of your Google Cloud Platform project.

If you are not certain about your project ID, check it in the Manage Resources page of your Google Cloud Platform services.

Cluster identifier

Enter the ID of your Dataproc cluster to be used.

Provide Google Credentials in file

Leave this check box clear, when you launch your Job from a given machine in which Google Cloud SDK has been installed and authorized to use your user account credentials to access Google Cloud Platform. In this situation, this machine is often your local machine.

When you launch your Job from a remote machine, such as a Jobserver, select this check box and in the Path to Google Credentials file field that is displayed, enter the directory in which this JSON file is stored in the Jobserver machine.

For further information about this Google Credentials file, see the administrator of your Google Cloud Platform or visit Google Cloud Platform Auth Guide.

Action

Select the action you want tGoogleDataprocManage to perform on the your cluster:
  • Start to create a cluster

  • Stop to destroy a cluster

Version

Select the version of the image to be used to create a Dataproc cluster.

Zone

Select the geographic zone in which the computing resources are used and your data is stored and processed.

A zone in terms of Google Cloud is an isolated location within a region, another geographical term employed by Google Cloud. As for the regions on Google Cloud Platform, the Studio supports only the Global region.

Instance configuration

Enter the parameters to determine how many masters and workers to be used by the Dataproc cluster to be created and the performance of these masters and workers.

Advanced settings

Wait for cluster ready

Select this check box to keep this component running until the cluster is completely set up.

When you clear this check box, this component stops running immediately after sending the command to create.

Master disk size

Enter a number without quotation marks to determine the size of the disk of each master instance.

Master local SSD

Enter a number without quotation marks to determine the number of local solid-state drive (SSD) storage devices to be added to each master instance.

According to Google, these local SSDs are suitable only for temporary storage such as caches, processing space or low value data. It is recommended to store important data to durable storage options of Google. For further information about the Google storage options, see Durable storage options.

Worker disk size

Enter a number without quotation marks to determine the size of the disk of each worker instance.

Worker local SSD

Enter a number without quotation marks to determine the number of local solid-state drive (SSD) storage devices to be added to each worker instance.

According to Google, these local SSDs are suitable only for temporary storage such as caches, processing space or low value data. It is recommended to store important data to durable storage options of Google. For further information about the Google storage options, see Durable storage options.

Network or Subnetwork

Select either check box to use a Google Compute Engine network or subnetwork for the cluster to be created to enable intra-cluster communications.

As Google does not allow network and subnetwork to be used concurrently, selecting one check box hides the other check box.

For further information about Google Dataproc cluster network configuration, see Dataproc Network.

Initialization action

In this table, select the initialization actions that are available in the shared bucket on Google Cloud Storage to run on all the nodes in your Dataproc cluster immediately after this cluster is set up.

If you need to use custom initialization scripts, upload them to this shared Google bucket so that tGoogleDataprocManage can read them.

  • In the Executable file column, enter the Google Cloud Storage URI to these scripts to be used, for example gs://dataproc-initialization-actions/MyScript

  • In the Executable timeout column, enter the amount of time within double quotation marks to determine the duration of the execution. If the executable is not completed at the end of this timeout, an explanatory error message is returned. The value is a string with up to nine fractional digits, for example, "3.5s" for 3.5 seconds.

For further information about this shared bucket and the initialization actions, see Initialization actions.

tStatCatcher Statistics

Select this check box to collect log data at the component level.

Usage

Usage rule

This component is used standalone in a subJob.