Connecting to a database - 6.2

Talend Data Management Platform Studio User Guide

EnrichVersion
6.2
EnrichProdName
Talend Data Management Platform
task
Data Quality and Preparation
Design and Development
EnrichPlatform
Talend Studio

Before proceeding to analyze data in a specific database, you must first set up the connection to this database. From the Profiling perspective of the studio, you can create a connection on the DataBase Management System (DBMS) and show database content.

The databases you can analyze from the Studio include Hive and Amazon Redshift. For further information about what databases you can profile, see What databases are supported from the Profiling perspective and the section about supported databases in the Talend Installation Guide.

Connections to different databases are reflected by different tree levels and different icons in the DQ Repository tree view because the logical and physical structure of data differs from one relational database to another. The highest level structure "Catalog" followed by "Schema" and finally by "Table" is not applicable to all database types. For further information, see Catalogs and schemas in database systems.

How to create a connection

Prerequisite(s): You have selected the Profiling perspective of the studio.

To create a database connection, do the following:

  1. In the DQ Repository tree view, expand Metadata, right-click DB Connections and select Create DB Connection.

    The [Database Connection] wizard opens.

  2. In the Name field, enter a name for this new database connection.

    Do not use spaces in the connection name.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  3. If required, set other connection metadata (purpose, description and author name) in the corresponding fields and click Next to proceed to the next step.

  4. In the DB Type field and from the drop-down list, select the type of database to which you want to connect. For example, MySQL.A warning message pops up if the database type you have selected cannot be used in the Profiling perspective.

    For further information about supported databases, see the Talend Installation Guide.

    Note

    If you select to connect to a database that is not supported in the studio (using the ODBC or JDBC methods), it is recommended to use the Java engine to execute the column analyses created on the selected database. For more information on column analyses, see Defining the columns to be analyzed and setting indicators, and for more information on the Java engine, see Using the Java or the SQL engine.

  5. In the DB Version field, select the version of the database to which you are creating the connection.

  6. Enter your login, password, server and port information in their corresponding fields.

    If needed, click the button to show all connection fields without having to scroll down in the wizard.

  7. In the Database field, enter the database name you are connecting to. If you need to connect to all of the catalogs within one connection, if the database allows you to, leave this field empty.

  8. Click the Check button to verify if your connection is successful.

    If you have not already installed the database driver (.jar file) necessary to use the database, you will have a wizard prompting you to install the relative third-party module, click Download and Install and then close the wizard.

    For further information about identifying and installing external modules, see the Talend Installation Guide or click the How to install a driver link in the wizard.

    For further information about the Module view, see Displaying the Module view.

  9. If needed, export your connection as a context and centralize it under the Context node in the Integration perspective of your Studio. This enables you to reuse this context in the data quality analyses that use the current connection. You can also create different context parameters for the same connection and later choose to execute an analysis on one specific context. For further information, see Using context variables to connect to data sources.

  10. Click Finish to close the [Database Connection] wizard.

    A folder for the created database connection is displayed under DB Connection in the DQ Repository tree view and under the Metadata node in the Integration perspective. The connection editor opens with the defined metadata in the studio.

    Once you create the connection, you can open in the studio a preview of the data in a specific database table. For further information, see Previewing data in the SQL editor.

From the connection editor, you can:

  • Click Connection information to show the connection parameters for the relevant database.

  • Click the Check button to check the status of your current connection.

  • Click the Edit... button to open the connection wizard and modify any of the connection information.

For information on how to set up a connection to a file, see Connecting to a file.

How to create a connection from a catalog or a schema

You can create a connection on a database catalog or schema directly from a database connection.

Prerequisite(s): At least one database connection is set in the Profiling perspective of the studio. For further information, see Connecting to a database

  1. In the DQ Repository tree view, expand Metadata > DB Connections and browse to the catalog or schema on which you want to create the connection.

  2. Right-click a catalog or schema and select Create a new connection.

    A confirmation message is displayed.

  3. Click OK.

    A new connection named after the selected connection and catalog is created under DB Connections.

How to create a connection to a custom database

The database connection wizard in the studio lists the databases to which you can create a connection and do profiling and integration processes.

Currently, fewer databases in the list are supported for profiling data. For further information, see What databases are supported from the Profiling perspective and the section about supported databases in the Talend Installation Guide.

You can still use the studio to connect to a custom "unsupported" database. To do this, you need to choose General JDBC as the database type in the connection wizard and then fill in the connection parameters. For further information, see Centralizing JDBC metadata.

After creating the connection to a custom database, you can profile and monitor data in this database by using different analyses and indicators, as you do with supported databases. But you may need to change, in the Indicator Settings editor, the SQL query template for some indicators, such as the regex indicator where each database has a different function to call. For further information, see How to edit a system indicator and How to edit a user-defined indicator.

Note

If you have a problem profiling a custom database even though you use a JDBC connection, the reason could be that some JDBC functions are not implemented by the JDBC driver library. Please raise an issue or ask support via the Talend Community Website at:

https://community.talend.com.

What databases are supported from the Profiling perspective

You can set a connection to a database either from the Profiling perspective or from the Integration perspective. The database connection you create are listed under the Metadata node in both perspectives. You can then use this connection in your data quality analyses and data integration Jobs.

However, fewer databases are supported for profiling data than those supported for integrating data, this is why the database list in the connection wizard is much longer in the Integration perspective. Also, in the Profiling perspective, the connections created on unsupported databases are labeled as Unsupported.

For further information about what databases are supported for profiling data, see the section about supported databases in the Talend Installation Guide.

What you need to know about some databases

Teradata:

If you select to connect to the Teradata database, please select the Yes option next to USE SQL Mode to enable the studio to use the SQL queries to retrieve metadata. The JDBC driver is not recommended with this database because of possible bad performance.

In the Teradata database, the regular expression function is installed by default only starting from version 14. If you want to use regular expressions with older versions of this database, you must install a User Defined Function in Teradata and add the indicator definition for Teradata in the studio. For further information, check the article How to configure regular expressions on Teradata.

Netezza:

The Netezza database does not support regular expressions. If you want to use regular expressions with this database, you must:

Hive:

If you select to connect to the Hive database, you will be able to create and execute different analyses as with the other database types.

In the connection wizard, you must select from the Distribution list the platform that hosts Hive. You must also set the Hive version and model. For further information, see Managing Metadata and http://hadoop.apache.org/.

-If you decide to change the user name in an embedded mode of a Hive connection, you must restart the studio before being able to successfully run the profiling analyses that use the connection. For more information about the use of user credentials with Hive see the article How user credentials work in Hive embedded mode in studio.

-If the Hadoop distribution to be used is Hortonworks Data Platform V1.2 or Hortonworks Data Platform V1.3, you must set proper memory allocations for the map and reduce computations to be performed by the Hadoop system. In the second step in the connection wizard:

  1. Click the button next to Hadoop Properties and in the open dialog box click the [+] button to add two lines to the table.

  2. Enter the parameters names as mapred.job.map.memory.mb and mapred.job.reduce.memory.mb.

  3. Set their values to the by-default value 1000.

    This value is normally appropriate for running the computations.

Note that one analysis type and few indicators and functions are still not supported for Hive, see the table below for more detail:

Unsupported indicators

Unsupported functionsUnsupported analyses

with SQL engine:

-Soundex Low Frequency.

-Pattern(Low) Frequency.

-Upper Quartile and Lower Quartile.

-Median.

- All Date Frequency indicators.

-the View rows contextual menu for column analyses with unique, duplicate and all textual indicators.

For further information on the View rows menu, see Viewing and exporting analyzed data.

-the View match rows contextual menu for column analyses with unique, duplicate and all textual indicators.

For further information on View match rows, see Comparing identical columns in different tables.

-all contextual menus on the analysis results of functional dependency analysis.

For further information on this analysis, see Detecting anomalies in columns (Functional Dependency Analysis).

-the only analysis that is not supported for Hive is Time Correlation Analysis as the Date data type does not exist in Hive. For further information on this analysis type, see Time correlation analyses.

Also, all right-click options on analysis results that generate Jobs to validate, standardize or deduplicate data are still not supported for Hive. For further information about these Jobs, see Data Cleansing.

Hive and HBase:

When you select to connect to a Hive or a HBase database to create and execute different analyses, then in the connection wizard, you must, as explained above, select from the Distribution list the platform that hosts Hive or HBase.

If the Hadoop distribution to be used is Hortonworks Data Platform V2.0 (YARN), you must set the following parameter in the Hadoop Properties table:

  • The parameter is

    yarn.application.classpath
  • The value is

    /etc/hadoop/conf,/usr/lib/hadoop/,/usr/lib/hadoop/lib/,/usr/lib/hadoop-hdfs/,/usr/lib/hadoop-hdfs/lib/,/usr/lib/hadoop-yarn/,/usr/lib/hadoop-yarn/lib/,/usr/lib/hadoop-mapreduce/,/usr/lib/hadoop-mapreduce/lib/

Catalogs and schemas in database systems

The structure of a database defines how objects are organized in the database. Different data storage structures are used to store objects in databases. For example, the highest-level structure (such as "Catalog" followed by "Schema" and finally by "Table") is not applicable to all database types.

The table below describes the structure of some databases in terms of catalog and schemas:

Database name

VersionCatalogSchema

Oracle

 noyes

MySQL

 yesno

SQLServer

2000/2005/2008

yesyes

DB2

 noyes

DB2 ZOS

 noyes

Sybase

 yesyes

Informix

 yesyes

PointBase

 noyes

PostgreSQL

 yesyes

AS/400

V5R4

yesyes

Ingres

 noyes

Teradata

 noyes

Netezza

 yesyes

SQLite

 nono