What you need to know about some databases - 7.1

Talend Real-time Big Data Platform Studio User Guide

author
Talend Documentation Team
EnrichVersion
7.1
EnrichProdName
Talend Real-Time Big Data Platform
task
Design and Development
EnrichPlatform
Talend Studio

MySQL

When creating a connection to MySQL via JDBC, it is not mandatory to include the database name to the JDBC URL. Regardless of whether the database connection URL specified in the JDBC URL field includes the database name, all databases are retrieved.

For example, if you specify jdbc:mysql://192.168.33.41:3306/tbi?noDatetimeStringSync=true, where tbi is the database name, or jdbc:mysql://192.168.33.41:3306/?noDatetimeStringSync=true, all databases are retrieved.

To support surrogate pairs, you need to edit the following properties in the MySQL server configuration file:

[client]
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4
character-set-server=utf8mb4

Microsoft SQL Server

Microsoft SQL Server 2012 and later are supported.

If you select to connect to the Microsoft SQL Server database with Windows Authentication Mode, you can select Microsoft or JTDS open source from the Db Version list.

When using a Microsoft SQL Server database to store report results, JTDS open source is the only driver supported. Then, it is recommended to select JTDS open source from the Db Version list to avoid conflicts between Microsoft and jTDS drivers.

Before creating a connection to the Microsoft SQL Server database using the jTDS driver:
  • Download the jTDS driver version 1.3.1 from http://jtds.sourceforge.net/.
  • Extract the files from the archive and copy the ntlmauth.dll file under x64/SSO or x86/SSO, according to your operating system.
  • Paste the ntlmauth.dll file to %SYSTEMROOT%/system32

If you encounter the following error: SSO Failed: Native SSPI library not loaded, paste the ntlmauth.dll to the bin folder of the JRE used by Talend Studio.

The collation used by the Microsoft SQL Server database must be case-insensitive, otherwise the report generation may not succeed. You may encounter errors like java.sql.SQLException: Invalid column name 'rep_runtime'. For more information about collation rules, see https://docs.microsoft.com/en-us/sql/t-sql/statements/windows-collation-name-transact-sql?view=sql-server-2017.

Teradata

If you select to connect to the Teradata database, please select the Yes option next to USE SQL Mode to enable the studio to use the SQL queries to retrieve metadata. The JDBC driver is not recommended with this database because of possible bad performance.

In the Teradata database, the regular expression function is installed by default only starting from version 14. If you want to use regular expressions with older versions of this database, you must install a User Defined Function in Teradata and add the indicator definition for Teradata in Talend Studio. To learn more about using regular expressions with Teradata, see the documentation on Using regular expressions with Teradata (https://help.talend.com).

Netezza

The Netezza database does not support regular expressions. If you want to use regular expressions with this database, you must:

  • Install the SQL Extensions Toolkit package on a Netezza system. Use the regex_like function provided in this toolkit in the SQL template as documented in http://pic.dhe.ibm.com/infocenter/ntz/v7r0m3/topic/com.ibm.nz.sqltk.doc/r_sqlext_regexp_like.html.
  • Add the indicator definition for Netezza in the Pattern Matching folder in Talend Studio under Libraries > Indicators > System Indicators.

    The query template you need to define for Netezza is as the following: SELECT COUNT(CASE WHEN REGEXP_LIKE(<%=COLUMN_NAMES%>,<%=PATTERN_EXPR%>) THEN 1 END), COUNT FROM <%=TABLE_NAME%> <%=WHERE_CLAUSE%>. For a detail procedure about how to add an indicator definition for a specific database, see Defining a query template for a specific database.

Hive

If you select to connect to the Hive database, you will be able to create and execute different analyses as with the other database types.

In the connection wizard, you must select from the Distribution list the platform that hosts Hive. You must also set the Hive version and model. For further information, see Centralizing database metadata and http://hadoop.apache.org/.

If you decide to change the user name in an embedded mode of a Hive connection, you must restart the studio before being able to successfully run the profiling analyses that use the connection. For further information about the use of user credentials with Hive, see the documentation on How user credentials work in Hive embedded mode in Talend Studio (https://help.talend.com).

If the Hadoop distribution to be used is Hortonworks Data Platform V1.2 or Hortonworks Data Platform V1.3, you must set proper memory allocations for the map and reduce computations to be performed by the Hadoop system. In the second step in the connection wizard:
  1. Click the button next to Hadoop Properties and in the open dialog box click the [+] button to add two lines to the table.
  2. Enter the parameters names as mapred.job.map.memory.mb and mapred.job.reduce.memory.mb.
  3. Set their values to the by-default value 1000.

    This value is normally appropriate for running the computations.

Note that one analysis type and few indicators and functions are still not supported for Hive, see the table below for more detail:
Unsupported indicators Unsupported functions Unsupported analyses
With SQL engine:

Soundex Low Frequency

Pattern(Low) Frequency

Upper Quartile and Lower Quartile

Median

All Date Frequency indicators

The View rows contextual menu for column analyses with unique, duplicate and all textual indicators. For further information on the View rows menu, see Viewing and exporting analyzed data.

The View match rows contextual menu for column analyses with unique, duplicate and all textual indicators. For further information on View match rows, see Comparing identical columns in different tables.

All contextual menus on the analysis results of functional dependency analysis. For further information on this analysis, see Detecting anomalies in columns (Functional Dependency Analysis).

The only analysis that is not supported for Hive is Time Correlation Analysis as the Date data type does not exist in Hive. For further information on this analysis type, see Time correlation analyses.

Also, all right-click options on analysis results that generate Jobs to validate, standardize or deduplicate data are still not supported for Hive. For further information about these Jobs, see Validating data.

Hive and HBase

When you select to connect to a Hive or a HBase database to create and execute different analyses, then in the connection wizard, you must, as explained above, select from the Distribution list the platform that hosts Hive or HBase.

If the Hadoop distribution to be used is Hortonworks Data Platform V2.0 (YARN), you must set the following parameter in the Hadoop Properties table:
  • The parameter is
    
                         yarn.application.classpath
                      
  • The value is
    
                         /etc/hadoop/conf,/usr/lib/hadoop/,/usr/lib/hadoop/lib/,/usr/lib/hadoop-hdfs/,/usr/lib/hadoop-hdfs/lib/,/usr/lib/hadoop-yarn/,/usr/lib/hadoop-yarn/lib/,/usr/lib/hadoop-mapreduce/,/usr/lib/hadoop-mapreduce/lib/
                      

Oracle

To support surrogate pairs, the NLS_CHARACTERSET parameter of the database must be set to UTF8 or AL32UTF8.

The default NLS_CHARACTERSET parameters are:

  • NLS_CHARACTERSET=WE8ISO8859P15

  • NLS_NCHAR_CHARACTERSET=AL16UTF16

Note:

To check the database parameters, you can run the following SQL query:

SQL> SELECT * FROM NLS_DATABASE_PARAMETERS;