tStandardizeRow properties for Apache Spark Streaming - 7.3

Standardization

Version
7.3
Language
English
Product
Talend Big Data
Talend Big Data Platform
Talend Data Fabric
Talend Data Integration
Talend Data Management Platform
Talend Data Services Platform
Talend ESB
Talend MDM Platform
Talend Real-Time Big Data Platform
Module
Talend Studio
Content
Data Governance > Third-party systems > Data Quality components > Standardization components
Data Quality and Preparation > Third-party systems > Data Quality components > Standardization components
Design and Development > Third-party systems > Data Quality components > Standardization components
Last publication date
2024-02-21

These properties are used to configure tStandardizeRow running in the Spark Streaming Job framework.

The Spark Streaming tStandardizeRow component belongs to the Data Quality family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

 

Built-In: You create and store the schema locally for this component only.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.

Column to parse

Select the column to be parsed from the received data flow

Standardize this field

Select this check box to standardize the rule-compliant data identified, that is, to replace the duplicates of the identified data with the corresponding standardized data from a given index.

For further information about this index providing standardized data, see tSynonymOutput.

Every time you select or clear this check box, the schema of this component is changed automatically, so in a given Job, you need to click the activated Sync columns button to fix the inconsistencies in the schema.

Generate analyzer code as routine

Click this button to enable the data parser of your Studio to recognize the rules defined in the Conversion rules table.

In a given Job, when a rule is created, this operation is required for the execution of this rule, while if it is on an existing rule that you have modified, this operation is required only when the modified rule is of type Enumeration, Format or Combination. For further information about all of the rule types, see Rule types.

and

Click the import or export button to exchange a given standardization rule set with the DQ Repository.

- When you click the export button, your studio is switched to the Profiling perspective and the Parser rule Settings view is opened on the workspace with the relative contents filled automatically . Then if need be, you can edit the exported rule set and save it to the Libraries > Rules > Parser folder in the DQ Repository tree view.

- When you click the import button, a import wizard is opened to help you import the standardization rule of interest.

For further information, see Talend Studio User Guide.

Conversion rules

Define the rules you need to apply as the following:

- In the Name column, type in a name of the rule you want to use. This name is used as the XML tag or the JSON attribute name and the token name to label the incoming data identified by this rule.

- In the Type column, select the type of the rule you need to apply. For further information about available rule types, see Rule types.

- In the Value column, type in the syntax of the rule.

- In the Search mode column, select a search mode from the list. The search modes can be used only with the Index rule type. For further information about available search modes, see Search modes for Index rules.

A test view is provided to help you create the parser rules of interest. For further information, see Talend Studio User Guide.

Advanced settings

Advanced options for INDEX rules

- Search UNDEFINED fields: select this check box if you want the component to search for undefined tokens in the index run results.

- Word distance for partial match (available for the Match partial mode): set the maximum number of words allowed to come inside a sequence of words that may be found in the index, default value is 1.

- Max edits for fuzzy match (Based on the Levenshtein algorithm and available for fuzzy modes): select an edit distance,1 or 2, from the list. Any terms within the edit distance from the input data are matched. With a max edit distance 2, for example, you can have up to two insertions, deletions or substitutions. The score for each match is based on the edit distance of that term.

Fuzzy match gains much in performance with Max edits for fuzzy match.

Note:

Jobs migrated in the Studio from older releases run correctly, but results might be slightly different because Max edits for fuzzy match is now used in place of Minimum similarity for fuzzy match.

Output format

-XML: this option is selected by default. It outputs normalized data in XML format.

-JSON: select this option to output normalized data in JSON format.

Usage

Usage rule

This component, along with the Spark Streaming component Palette it belongs to, appears only when you are creating a Spark Streaming Job.

This component is used as an intermediate step.

You need to use the Spark Configuration tab in the Run view to define the connection to a given Spark cluster for the whole Job.

This connection is effective on a per-Job basis.

For further information about a Talend Spark Streaming Job, see the sections describing how to create, convert and configure a Talend Spark Streaming Job of the Talend Big Data Getting Started Guide .

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Connections

Outgoing links (from this component to another):

Row: Main; Reject

Incoming links (from one component to this one):

Row: Main; Reject

For further information regarding connections, see Talend Studio User Guide.

Spark Connection

In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
  • Yarn mode (Yarn client or Yarn cluster):
    • When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab.

    • When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab.

    • When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab.
    • When using Qubole, add a tS3Configuration to your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster.
    • When using on-premises distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration Apache Spark Batch or tS3Configuration Apache Spark Batch.

    If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem).

This connection is effective on a per-Job basis.