Connecting to a database - 6.5

Talend Open Studio for Data Quality User Guide

EnrichVersion
6.5
EnrichProdName
Talend Open Studio for Data Quality
task
Data Quality and Preparation
EnrichPlatform
Talend Studio

Before proceeding to analyze data in a specific database, you must first set up the connection to this database. From the Profiling perspective of the studio, you can create a connection on the DataBase Management System (DBMS) and show database content.

The databases you can analyze from the Studio include Hive and Amazon Redshift. For further information about what databases you can profile, see the section about supported databases in the Talend Installation and Upgrade Guide.

Connections to different databases are reflected by different tree levels and different icons in the DQ Repository tree view because the logical and physical structure of data differs from one relational database to another. The highest level structure "Catalog" followed by "Schema" and finally by "Table" is not applicable to all database types. For further information, see Catalogs and schemas in database systems.

How to create a connection

To create a database connection, do the following:

  1. In the DQ Repository tree view, expand Metadata, right-click DB Connections and select Create DB Connection.

    The [Database Connection] wizard opens.

  2. In the Name field, enter a name for this new database connection.

    Do not use spaces in the connection name.

    Note

    Avoid using special characters in the item names including:

    "~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".

    These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

  3. If required, set other connection metadata (purpose, description and author name) in the corresponding fields and click Next to proceed to the next step.

  4. In the DB Type field and from the drop-down list, select the type of database to which you want to connect. For example, MySQL.

    For further information about supported databases, see the Talend Installation and Upgrade Guide.

    Note

    If you select to connect to a database that is not supported in the studio (using the ODBC or JDBC methods), it is recommended to use the Java engine to execute the column analyses created on the selected database. For more information on column analyses, see Defining the columns to be analyzed and setting indicators, and for more information on the Java engine, see Using the Java or the SQL engine.

  5. In the DB Version field, select the version of the database to which you are creating the connection.

  6. Enter your login, password, server and port information in their corresponding fields.

  7. In the Database field, enter the database name you are connecting to. If you need to connect to all of the catalogs within one connection, if the database allows you to, leave this field empty.

  8. Click the Check button to verify if your connection is successful.

    If you have not already installed the database driver (.jar file) necessary to use the database, you will have a wizard prompting you to install the relative third-party module, click Download and Install and then close the wizard.

    For further information about identifying and installing external modules, see the Talend Installation and Upgrade Guide or click the How to install a driver link in the wizard.

    For further information about the Module view, see Displaying the Module view.

  9. Click Finish to close the [Database Connection] wizard.

    A folder for the created database connection is displayed under DB Connection in the DQ Repository tree view . The connection editor opens with the defined metadata in the studio.

    Once you create the connection, you can open in the studio a preview of the data in a specific database table. For further information, see Previewing data in the SQL editor.

From the connection editor, you can:

  • Click Connection information to show the connection parameters for the relevant database.

  • Click the Check button to check the status of your current connection.

  • Click the Edit... button to open the connection wizard and modify any of the connection information.

For information on how to set up a connection to a file, see Connecting to a file.

How to create a connection from a catalog or a schema

You can create a connection on a database catalog or schema directly from a database connection.

Prerequisite(s): At least one database connection is set in the Profiling perspective of the studio. For further information, see Connecting to a database

  1. In the DQ Repository tree view, expand Metadata > DB Connections and browse to the catalog or schema on which you want to create the connection.

  2. Right-click a catalog or schema and select Create a new connection.

    A confirmation message is displayed.

  3. Click OK.

    A new connection named after the selected connection and catalog is created under DB Connections.

How to create a connection to a custom database

The database connection wizard in the studio lists the databases to which you can create a connection and do profiling processes.

You can still use the studio to connect to a custom "unsupported" database. To do this, you need to choose General JDBC as the database type in the connection wizard and then fill in the connection parameters.

After creating the connection to a custom database, you can profile and monitor data in this database by using different analyses and indicators, as you do with supported databases. But you may need to change, in the Indicator Settings editor, the SQL query template for some indicators, such as the regex indicator where each database has a different function to call. For further information, see How to edit a system indicator and How to edit a user-defined indicator.

Note

If you have a problem profiling a custom database even though you use a JDBC connection, the reason could be that some JDBC functions are not implemented by the JDBC driver library. Please raise an issue or ask support via Talend Community at:

https://community.talend.com/

What you need to know about some databases

Teradata:

In the Teradata database, the regular expression function is installed by default only starting from version 14. If you want to use regular expressions with older versions of this database, you must install a User Defined Function in Teradata and add the indicator definition for Teradata in the studio. To learn more about using regular expressions with Teradata, see the documentation on Talend Help Center (https://help.talend.com).

Netezza:

The Netezza database does not support regular expressions. If you want to use regular expressions with this database, you must:

Hive:

If you select to connect to the Hive database, you will be able to create and execute different analyses as with the other database types.

In the connection wizard, you must select from the Distribution list the platform that hosts Hive. You must also set the Hive version and model. For further information, see http://hadoop.apache.org/.

-If you decide to change the user name in an embedded mode of a Hive connection, you must restart the studio before being able to successfully run the profiling analyses that use the connection. For further information about the use of user credentials with Hive, see the documentation on Talend Help Center (https://help.talend.com).

-If the Hadoop distribution to be used is Hortonworks Data Platform V1.2 or Hortonworks Data Platform V1.3, you must set proper memory allocations for the map and reduce computations to be performed by the Hadoop system. In the second step in the connection wizard:

  1. Click the button next to Hadoop Properties and in the open dialog box click the [+] button to add two lines to the table.

  2. Enter the parameters names as mapred.job.map.memory.mb and mapred.job.reduce.memory.mb.

  3. Set their values to the by-default value 1000.

    This value is normally appropriate for running the computations.

-If the Hadoop distribution to be used is Hortonworks Data Platform V2.0 (YARN), you must set the following parameter in the Hadoop Properties table:

  • The parameter is

    yarn.application.classpath
  • The value is

    /etc/hadoop/conf,/usr/lib/hadoop/,/usr/lib/hadoop/lib/,/usr/lib/hadoop-hdfs/,/usr/lib/hadoop-hdfs/lib/,/usr/lib/hadoop-yarn/,/usr/lib/hadoop-yarn/lib/,/usr/lib/hadoop-mapreduce/,/usr/lib/hadoop-mapreduce/lib/

Note that one analysis type and few indicators and functions are still not supported for Hive, see the table below for more detail:

Unsupported indicators

Unsupported functionsUnsupported analyses

with SQL engine:

-Soundex Low Frequency.

-Pattern(Low) Frequency.

-Upper Quartile and Lower Quartile.

-Median.

- All Date Frequency indicators.

-the View rows contextual menu for column analyses with unique, duplicate and all textual indicators.

For further information on the View rows menu, see Viewing and exporting analyzed data.

-the View match rows contextual menu for column analyses with unique, duplicate and all textual indicators.

For further information on View match rows, see Comparing identical columns in different tables.

-all contextual menus on the analysis results of functional dependency analysis.

For further information on this analysis, see Detecting anomalies in columns (Functional Dependency Analysis).

-the only analysis that is not supported for Hive is Time Correlation Analysis as the Date data type does not exist in Hive. For further information on this analysis type, see Time correlation analyses.

Catalogs and schemas in database systems

The structure of a database defines how objects are organized in the database. Different data storage structures are used to store objects in databases. For example, the highest-level structure (such as "Catalog" followed by "Schema" and finally by "Table") is not applicable to all database types.

The table below describes the structure of some databases in terms of catalog and schemas:

Database name

VersionCatalogSchema

Oracle

 noyes

MySQL

 yesno

SQLServer

2000/2005/2008

yesyes

DB2

 noyes

DB2 ZOS

 noyes

Sybase

 yesyes

Informix

 yesyes

PointBase

 noyes

PostgreSQL

 yesyes

AS/400

V5R4

yesyes

Ingres

 noyes

Teradata

 noyes

Netezza

 yesyes

SQLite

 nono