Big Data: new features - 7.3

Talend Data Fabric Release Notes

author
Talend Documentation Team
EnrichVersion
7.3
EnrichProdName
Talend Data Fabric
task
Installation and Upgrade
Release Notes

Spark Job designer enhancements

Feature

Description

ADLS Gen2 Azure Data Lake Storage Generation2 is now supported with the following Big Data platforms:
  • Databricks V5.5 LTS
  • Cloudera CDH V6.1
  • Hortonworks Data Platform V3.1
Snowflake The Snowflake components for Spark Batch are officially supported. They are not in technical preview status anymore.
Native Datasets
In Spark Batch Jobs, support for native Spark Datasets has been added to more components to obtain inherent performance gains. To benefit from this enhancement, users must be using Spark V2.0 onwards with the following components:
  • tFileInputParquet and tFileOutputParquet
  • tFileInputDelimited and tFileOutputDelimited
  • tFileInputFullRow
  • tFileInputPositional and tFileInputRegex
  • tSortRow, tExtractDelimitedFields, tExtractPositionalFields, tExtractRegexFields, tExtractXMLField, tExtractJSONFields, tNormalize, tReplace, tReplicate, tSample, tUnite and tSchemaComplianceCheck.
The following components require Spark V2.1 onwards to support Spark Datasets.
  • tAggregateRow
  • Left Outer Join in tMap, in addition to the tMap features that have had support for Datasets since Talend Studio V7.2.
Delta Lake The tDeltaLakeInput and tDeltaLakeOutput components are not in technical preview anymore.
Apache Spark V2.4 This new Aparch Spark version is supported with more Big Data platforms in Spark Batch and Spark Streaming Jobs. The platforms which now support Spark V2.4 are:
  • Cloudera CDH6.1.1
  • Databricks V5.5
  • Google Cloud Dataproc V1.4
Job status With Databricks, users are enabled to configure how often the Studio asks a Spark cluster for Job status.
tS3Configuration With Amazon EMR, users can now apply an S3 bucket policy.
tAggregateRow In Spark Batch Jobs, the Count (distinct) function and the Sample Standard Deviation Algorithm function have been added.
New driver versions
The support for the following driver versions has been added to their related components:
  • Redshift JDBC driver V1.23.7.106
  • MySQL driver V8.0.18
  • Teradata JDBC driver V16.20.00.13
  • MariaDB JDBC driver V2.5.3 in JDBC components
  • Snowflake JDBC driver V3.11.x

New components available

Two new components are now available: tAzureAdlsGen2Input and tAzureAdlsGen2Output.

Support for Big Data platforms

Feature

Description

Databricks
  • Databricks V5.5 LTS is now supported by Spark Jobs.
  • Support for transient clusters of Azure Databricks has been added.
Hortonworks Data Platform
  • Hortonworks Data Platform V3.1 is supported.
  • The Hortonworks Data Platform V3.x series is now officially available among the Dynamic Distributions. They are not on technical preview anymore.

Google Cloud Dataproc

  • Google Cloud Dataproc V1.4 is supported
  • In Standard Jobs, tGoogleDataprocManage supports all regions.
Custom Hadoop configuration When defining connections to Cloudera or Hortonworks in Repository, users can now specify a custom JAR file to provide the connection parameters of the Hadoop environment to be used.

Other components

Feature

Description

Kafka Kafka V2.2.1 is now officially supported with:
  • Cloudera CDH V6.1
  • Hortonworks Data Platform V3.1
  • Kafka components in Standard Jobs
Google BigQuery
  • In tBigQueryBulkExec, users can now drop tables with either a service account or their OAuth 2.0 credentials.
  • The BigQuery components now support Google cloud client API 1.25.10.
Couchbase
  • tCouchbaseOutput now allows users to perform N1QL queries with parameters.
  • Non-JSON documents are supported.

CXF

CXF V3.3.4 is now supported in the following components:

  • tDBFSConnection, tDBFSGet, tDBFSPut
  • tHCatalogInput, tHCatalogLoad, tHCatalogOperation, tHCatalogOutput

MongoDB

The support for MongoDB V4.2.x has been added to the MongoDB components in Standard Jobs.