tDataprepRun properties for Apache Spark Streaming - 7.3

Data Preparation

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Data Preparation
Talend Studio
Content
Data Governance > Third-party systems > Data Preparation components
Data Quality and Preparation > Third-party systems > Data Preparation components
Design and Development > Third-party systems > Data Preparation components
Last publication date
2024-02-21

These properties are used to configure tDataprepRun running in the Spark Streaming Job framework.

The Spark Streaming tDataprepRun component belongs to the Talend Data Preparation family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

URL

Type the URL to the Talend Data Preparation web application, between double quotes.

If you are working with Talend Cloud Data Preparation, use the URL for the corresponding data center to access the application, for example, https://tdp.us.cloud.talend.com for the AWS US data center.

For the URLs of available data centers, see Talend Cloud regions and URLs.

Email

Type the email address that you use to log in the Talend Data Preparation web application, between double quotes.

Password

Click the [...] button and type your user password for the Talend Data Preparation web application, between double quotes.

If you are working with Talend Cloud Data Preparation and if:

  • SSO is enabled, enter an access token in the field.
  • SSO is not enabled, enter either an access token or your password in the field.

When using the default preparation selection properties:

Preparation

To complete the Preparation field, click Choose an existing preparation and select one of the previously created preparations in a pop-up dialog box. This dialog box shows the name, path, author, and last modification date of each preparation.

Click this button to edit the preparation in Talend Data Preparation that corresponds to the ID defined in the Preparation field.

Version

If you have created several versions of your preparation, you can choose which one you want to use in the Job. To complete the Version field, click Choose a Version to select from the list of existing versions, including the current version of the preparation.

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

Fetch Schema

Click this button to retrieve the schema from the preparation defined in the Preparation field.

When using the Dynamic preparation selection:

Dynamic preparation selection

Select this checkbox to define a preparation path and version using context variables. The preparation will be dynamically selected at runtime.

Preparation path

Use a context variable to define a preparation path. Paths with or without the initial / are supported.

Preparation version

Use a context variable to define the version of the preparation to use. Preparation versions are referenced by their number. As a consequence, to execute the version #2 of a preparation for example, the expected value is "2". To use the current version of the preparation, the expected value is "Current state".

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

Fetch Schema

Click this button to dynamically retrieve the schema from the preparations defined by the context variable in the Preparation path field. If the fetch is successful, any previously configured schema will be overwritten. If the fetch fails, the current schema is kept.

Advanced settings

Encoding

Select an encoding mode from this list. You can select Custom from the list to enter an encoding method in the field that appears.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

Usage rule

This component is an intermediary step. It requires an input flow as well as an output.

Limitations

  • If the dataset is updated after the tDataprepRun component has been configured, the schema needs to be fetched again.

  • If a context variable was used in the URL of the dataset, you cannot use the button to edit the preparation directly in Talend Data Preparation.

  • The Make as header and Delete row functions, as well as any modification of a single cell, are ignored by the tDatarepRun component. These functions only affect a single row or cell and are thus not compatible with a Big Data context. In the list of existing preparations to choose from, a warning is displayed next to preparations that include incompatible actions.

  • With the 7.0 version of Talend Data Fabric, when using Spark 1.6, the tDataprepRun component will only work with the 5.12 or 5.13 version of Cloudera. There is no Cloudera version restriction with Spark 2.0.

Yarn cluster mode

When the Yarn cluster mode is selected, the Job driver is not executed on a local machine, but rather on a machine from the Hadoop Cluster. Because it is not possible to know in advance which node of the cluster the Job will be executed on, you have to make sure that all the cluster nodes are accessible by the Talend Data Preparation server.