MDM SQL Storage

EnrichVersion
6.4
6.3
6.2
6.1
6.0
5.6
EnrichProdName
Talend Open Studio for MDM
Talend Data Fabric
Talend MDM Platform
task
Installation and Upgrade
EnrichPlatform
Talend MDM Server

MDM SQL Storage

A new storage for MDM based on SQL allows MDM to store user data on a Relational Database Management System (RDBMS).

MDM SQL storage provides the same services as the Talend XML database (deprecated since version 5.3), while allowing better performance and scalability.

To use MDM SQL storage, you do not need to change Jobs or data models because MDM still communicates with the client using XML documents. The MDM-related components such as tMDMInput, tMDMOuput, and tMDMBulkload are still supported when using a SQL database.

However, MDM stored procedures need to be written using SQL statements since the query is directly executed against the underlying database (when using an XML database, stored procedures were written using XQuery).

For suggestions on how to handle any issues you may encounter with MDM SQL Storage, see Troubleshooting for MDM SQL Storage.

This article applies to Talend MDM v5.2.0 or higher.

Working principles of MDM SQL storage

In MDM, user containers are used to host master and staging data, while system containers are used to store MDM-specific data, for example, the information of users and view definitions.

Starting from version 5.3, there is no difference between system containers and user containers since both can use SQL storage.

All non-user defined data containers are considered as system containers except UpdateReport. Since this container uses a well-defined format, it is stored in SQL databases when SQL storage is enabled.

When MDM receives a request that may be a query, a "create record" request, or an "update record" request, it first identifies whether the request is on a system container or a user container, and then forwards this request to the right storage.

Each SQL storage is independent: it has its own connection pool and configuration. This is important as it directly impacts the number of connections MDM establishes with the SQL databases. If you have two data containers and a minimum of 20 connections, it means MDM will create 40 (20*2) connections during startup.

To support different types of RDBMS, SQL storage uses Hibernate internally to support multiple database types.

For more information about Hibernate, see http://www.hibernate.org/.

Common configurations for using MDM SQL storage

Once the MDM server installation is done, you need to carry out some configuration tasks before using MDM SQL storage.

Procedure

  1. Go to the directory where you installed Talend MDM and open the file mdm.conf under the directory <mdm_jboss_dir>\jboss-4.2.2.GA\server\default\conf.
  2. Make sure that the configuration is correct, as shown below.
    xmlserver.class=com.amalto.core.storage.SQLWrapper
     
    db.autoPrepare=true
     
    db.datasources=datasources.xml
     
    db.default.datasource=MySQL-Default

    These properties are explained in the following table:

    Property

    Description

    db.autoPrepare

    Indicates to MDM whether it should try (or not) to create database on its own. By default, it is true.

    db.datasources

    Points to a file that provides connection information to the database. There is no default value for this property. The installer deploys a file named “datasources.xml”. MDM looks for the file in the following order:

    1. In the folder that system property “jboss.server.home.dir” points to.
    2. At the root of the classpath of MDM application.

    db.default.datasource

    Tells MDM what default datasource in the db.datasources file should be used. You may have different datasources such as “MySQL-Dev” and “MySQL-Prod” and use this property to switch from an environment to another.

    In this example, mdm.conf sets autoPrepare to true, uses the file datasources.xml as datasource configuration and the defaultDataSource is MySQL-Default.

  3. Under the same directory, open the file datasources.xml to check the datasource configuration.
    <datasources xmlns=”http://www.talend.com/mdm/datasources"”>
        <datasource name=”MySQL-DS1> <!-- (1) -->  
            <master>
                <type>RDBMS</type> <!-- (2) -->
                <rdbms-configuration>
                    <dialect>MySQL</dialect> <!-- (3) -->
                    <connection-driver-class>com.mysql.jdbc.Driver</connection-driver-class> <!-- (4) -->
                    <connection-url>jdbc:mysql://10.42.150.15:3306/mdm_dev1</connection-url> <!-- (5) -->
                    <connection-username>root</connection-username> <!-- (6) -->
                    <connection-password>toor</connection-password> <!-- (7) -->
                    <fulltext-index-directory>/var/lucene/indexes/DS1</fulltext-index-directory> <!-- (8) -->
                    <cache-directory>/var/cache/DS1</cache-directory> <!-- (9) -->
                    <contains-optimization>fulltext|disabled|like</contains-optimization> <!-- (10) -->
                    <schema-generation>update</schema-generation>  <!-- (11) -->
                    <properties>
                        <property name="hibernate.show_sql">false</property> <!-- (12)-->
                    </properties>
                    <init>
                        <database-name>mdm_dev1</database-name> <!-- (13) -->
                        <connection-url>jdbc:mysql://10.42.150.15:3306/</connection-url> <!-- (14) -->
                        <connection-username>root</connection-username> <!-- (15) -->
                        <connection-password>toor</connection-password> <!-- (16) -->
                    </init>
                </rdbms-configuration>
            </master>
            <staging>
                <!-- same content: this configures database for the staging area -->
            </staging>
            <system> <!-- New for 5.3 -->
                <!-- same content: this configures database for the system storage -->
            </system>
        </datasource>
    </datasources>

    See below for the explanation of each property in the datasource configuration file:

    • (1) A unique name that identifies the datasource.
    • (2) Tells MDM what kind of data source it is. Only RDBMS value is currently supported.
    • (3) Specifies a dialect for MDM to use specific SQL query generators for the database.
    • (4) JDBC driver class. Note that class has to be present in the MDM server classpath.
    • (5) JDBC connection URL. It is usually dependent on the JDBC driver.
    • (6) Connection user name.
    • (7) Connection password.
    • (8) This SQL storage still allows full text searches and this property tells MDM where the full text indices should be stored on the disk. If this property is missing, the storage will not be able to perform full text search.
    • (9) This property tells MDM where second level cache should store information (second level cache leverages EHCache). If this property is missing, the storage will not use second level cache (this does not bring much improvements since MDM already uses first level cache of Hibernate, this may become handy for large volumes of data with many requests for random records).
    • (10) Type of "contains" query optimization: "fulltext" (default) indicates MDM should use full-text queries for contains when applicable, "like" means MDM should use database "LIKE" statements, and "disabled" means MDM will throw an exception every time a "contains" query is processed.
    • (11) Indicates the type of actions to be taken on the database schema.

      Three values are valid for the schema-generation element:

      Value

      Effect

      update

      Update existing schema: add columns for newly added elements to an entity. This does not remove previous constraints nor previous elements you removed.

      create

      Drop existing tables before re-creating tables. You can use this mode when you are not sure of the data model yet. This removes obsolete constraints and columns as well as previous data.

      validate

      Make no schema change. Ensure that database schema will be able to store MDM records. In case of error(s), initialization of SQL storage fails. Check your server's log or console for more information.

      When <schema-generation> is set to "create", schema creation errors do not stop SQL storage initialization: Hibernate generates SQL statements that fail when you start on an empty database. To see these errors, enable Log4j category "com.amalto.core.storage" at the DEBUG level.

    • (12) This example indicates whether to show SQL statements on the console or not. If it is set to true, all the SQL statements will be written to the console. You may enable this option for troubleshooting purposes. Note that verbose information will be output after you enable this option.
    • (13) Name of the database: this is used during storage initialization when the autoPrepare option is true in the file mdm.conf. MDM will try to create a database named after this value on storage initialization.
    • (14) (15) (16) When autoPrepare is enabled, those credentials are used to create a database named after (13).

    Everything contained in the “properties” is passed “as is” to Hibernate configuration. All properties in this configuration section overrides the configuration MDM generates (no validation or control is performed on the values passed here). You may consult Hibernate documentation to get more information on the properties you can set.

Using variables in the datasource configuration file

You may use variables in the datasource configuration.

For example, use the variable ${container} in the datasources configuration.

<datasources xmlns=”http://www.talend.com/mdm/datasources"”>
    <datasource name=”MySQL-DS1?>
        <type>RDBMS</type>
        <rdbms-configuration>
            <connection-url>jdbc:mysql://10.42.150.15:3306/${container}</connection-url>
            <database-name>${container}</database-name>
            <init>
                <database-name>${container}</database-name>
                <connection-url>jdbc:mysql://10.42.150.15:3306/</connection-url>
                <connection-username>root</connection-username>
                <connection-password>toor</connection-password>
            </init>
        </rdbms-configuration>
    </datasource>
</datasources>

The variable ${container} will be replaced by the data container name when it is being created.

For example, when a container "Product" is created, the <connection-url> element will be "jdbc:mysql:10.42.150.15:3306/Product". The <database-name> element will be replaced by the value "Product" too.

The <connection-username> element also accepts the ${container} variable.

Using multiple database vendors in the datasource configuration file

You can use different database vendors for the different MDM storages.

For example, MySQL may be used for the user data, H2 for staging and Oracle for system objects.

<datasources xmlns="http://www.talend.com/mdm/datasources">
    <datasource name="RDBMS-1">
        <master>
            <type>RDBMS</type>
            <rdbms-configuration>
                <dialect>MySQL</dialect>
                <connection-driver-class>com.mysql.jdbc.Driver</connection-driver-class>
                <connection-url>jdbc:mysql://localhost:3306/${container}</connection-url>
                <connection-username>root</connection-username>
                <connection-password>toor</connection-password>
                <fulltext-index-directory>/var/lucene/indexes/DS1</fulltext-index-directory>
                <init>
                    <connection-url>jdbc:mysql://localhost:3306/</connection-url>
                    <connection-username>root</connection-username>
                    <connection-password>toor</connection-password>
                    <database-name>${container}</database-name>
                </init>
             </rdbms-configuration>
        </master>
        <staging>
            <type>RDBMS</type>
            <rdbms-configuration>
                <dialect>H2</dialect>
                <connection-driver-class>org.h2.Driver</connection-driver-class>
                <connection-url>jdbc:h2:target/data/h2_staging_ds2;MVCC=true</connection-url>
                <schema-generation>create</schema-generation>
            </rdbms-configuration>
        </staging>
        <system>
            <type>RDBMS</type>
            <rdbms-configuration>
                <dialect>Oracle11g</dialect>
                <connection-driver-class>oracle.jdbc.driver.OracleDriver</connection-driver-class>
                <connection-url>jdbc:oracle:thin:@fakehost:1521:XE</connection-url>
                <connection-username>SYSTEM</connection-username>
                <connection-password>password</connection-password>
            </rdbms-configuration>
        </system>
    </datasource>
</datasources>

Database Specific Configurations for Using MDM SQL Storage

Since MDM supports different types of databases, you can perform specific configurations for each of them.

SQL Server 2008

SQL Server 2008 can rapidly run into concurrent issues with deadlocks upon insert or updates. This kind of issue appears very rapidly, and it does not require a heavy load.

To fix this issue, you need to configure your SQL Server instance. Run the following SQL:

ALTER DATABASE <database name>   SET READ_COMMITTED_SNAPSHOT ON
  WITH ROLLBACK IMMEDIATE;

Using MDM SQL storage

Procedure

  1. Ensure that the configurations are correct depending on the database you are using.
  2. In Talend MDM studio, deploy a data model named Product.
    The name of the data container must be the same as that of the data model. To initialize a SQL storage, it needs the metadata of the entities it will manage since the database schema is derived from the entities. There is currently no mapping between a data container and the data model(s) it will manage, so the naming convention must be followed.
  3. Create a data container Product and deploy it to the server.
    When using a studio of version later than 5.3, you may select both data model and data container in the repository and deploy them to the server. MDM will then re-order the deployments in correct order (first data model and then data container).