MDM Staging Area Validation

author
Talend Documentation Team
EnrichVersion
6.4
6.3
6.2
6.1
6.0
EnrichProdName
Talend MDM Platform
Talend Data Fabric
task
Data Governance > Validating data
EnrichPlatform
Talend MDM Server

MDM Staging Area Validation

MDM staging area, a newly introduced feature in Release 5.2, is where MDM users can load data that MDM will asynchronously validate by background processes. It is only available for SQL storage and cannot be enabled for XML storage. The staging area is actually a mirror of the SQL storage used to store master data records. For each data container (a data container being an instance of a SQL storage), there is a second SQL storage being the staging area for the data container. For example, if you have a data container named "Product", there is an implicit data container named "Product#STAGING".

There are no huge differences for the database schema between Product" and "Product#STAGING". Their differences lie in that the staging area has:

  • No FK relation: All FK relations are disabled, so users can load data in the staging area without taking into account the relationships between entities. In a sense, integrity constrains for the staging area are the same as those for the XML database.
  • Additional column for source: Database schema has an additional text column where MDM users can provide additional details about the origin of the record. This column is a free-form and non-mandatory column. When loading data in the staging area, MDM users may indicate that the record Product with id "1" comes from SAP and Product with id "2" comes from another legacy system. This value could be used to create the "source" value for a Data Stewardship Console (DSC) task. Note that this is not supported in Release 5.2 or 5.3.
  • Additional column for status: This is an important column used to store actions done by MDM on this staging record. This column contains a code, whose values are intentionally similar to HTTP return codes and are described below:
    • "000" or null: indicates that the record is a "new record".
    • "2nn" values (200, 201...): indicates that the record has successfully passed the record validation process.
    • "4nn" values (400, 401...): indicates that the record failed to pass the record validation process.

From Release 5.4, MDM components include a list box where you can choose between "Staging" and "Master".

Before Release 5.4, you can suffix a container name with "#STAGING". Besides, the suffix "#STAGING" is case sensitive, that is, "#staging", "#sTaGiNg"... are not expected to work.

How to load data in the staging area?

MDM users can load data in the staging area in one of the following ways:

  • Use SQL components to execute INSERT statements on the database (using JDBC or DI components).
  • Use MDM components tMDMOuput or tMDMBulkload with the data container name being "data_container_name#STAGING" instead of "data_container_name”. For example, "Product#STAGING" instead of "Product". Note that tMDMOutput does not allow insertion or update of invalid data. If you need to disable the validation checking of data records, use tMDMBulkload with the “Validate” option set to false .

How to transfer data from the staging area to the master database?

MDM provides both a UI and a REST access to trigger the transfer of data from the staging area to the master database. The transfer is called "staging area validation task" because it includes a step where records from the staging area are validated against MDM validation rules (XSD, Security, Validation rules). For more information, see the following sections related to the staging area validation.

Staging area validationIntroduction

Once the staging area is filled with new records, MDM users may want to transfer data from the "staging area" to the "master data record area". This can be easily done by creating a "Staging area task". A staging area task works on a given data container and sequentially performs the following operations:

  • Identify similar records in the staging area, and group them to create a "cluster" of record. This phase is called "cluster identification".
  • Merge records of a given cluster to create a unique record. For most cases, this can easily be done, for example, the cluster size may be equal to 1. For advanced cases, this phase will mark records of the cluster as records to be merged with a DSC task.
  • Validate the merged record from the previous phase using a standard save operation and the merged record will then go through the save operation. Staging area task will mark records of the cluster either as "valid" or "invalid" depending on the result of the save operation.
  • Create DSC tasks for records that need human interaction to be merged.

For each phase, the staging area task changes the value of the "status" column:

  • 000: New record. This is the default value when a new row is inserted into the staging area.
  • 201: Record passed the "cluster identification" phase successfully.
  • 202: Record passed the "merge cluster" phase. It was grouped by other records and was used to create a golden record.
  • 203: Record passed the "merge cluster" phase successfully but automatic merge could not make a trusted golden record because the confidence score in the golden record is too low.
  • 204: Record passed the "merge cluster" phase successfully and this record is the unique (golden) record of the cluster. This record is the one used for MDM validation.
  • 205: Record passed the "MDM Validation" phase successfully. This record then also exists in the master database.
  • 206: Record was deleted.
  • 207: (internal) Record was merged using a DSC task resolution.
  • 208: (internal) Record needs a rematch.
  • 401: Record failed to pass the "cluster identification" phase.
  • 402: Record failed to pass the "merge cluster" phase.
  • 403: Record failed to pass the "MDM validation" phase, due to a validation issue against the data model.
  • 404: Record failed to pass the "MDM validation" phase, due to a constraint issue, for example, an FK constraint issue.
  • 405: Record failed to be deleted due to a constraint issue, for example, an FK constraint issue.

All of the status codes are constants in the interface com.amalto.core.storage.task.StagingConstants (org.talend.mdm.core).

The status field is an integer in the database.

You can perform a SQL query such as

SELECT x_id FROM PRODUCT WHERE x_talend_staging_status > 200 AND x_talend_staging_status < 400

, and it will return all Product records that did not fail to pass any of the phases during the staging record validation process.

A staging area task is always run in the background. Note that there is an internal API to start a task and wait for its end.

Once the staging area task is started, all staging task execution statistics (such as the data container it runs on, how many records are validated, and how many records are left) are stored in the staging area database in the table "TALEND_TASK_EXECUTION".

Since there is no scheduling inside MDM, MDM users can start a staging validation task when they need to. For example, an MDM user may use Talend Administration Center (TAC) to schedule a staging validation task.

MDM users can use the web application dedicated to the staging area or use a Job with a tREST component to call the correct REST API. See the API section for more information.

LimitationsRecursive entities

For entities that have FK to itself (for example, Person might have an FK relation 'is child of' to Person), the validation process does not ensure the correct insertion order, so the validation of such records may fail even though the data integrity is correct.

To work around this issue, it is recommended to disable FK integrity checks for the FK field in the data model editor.

The validation process still guarantees the correct insertion order when working with different entities. For instance, if "Person" has a FK 'address' to the entity "Address", all "Address" instances will be validated before all "Person" instances.

Unresolved foreign keys

In the staging area, MDM users can insert invalid values for FK. For example, the FK column "Address" in the "Person" entity may point to an incorrect id, which may be an ID that does not exist in "Address" instances.

The staging area allows invalid values for FK (users may insert later on an "Address" instance with the unresolved ID). However, during validation, invalid foreign keys are silently ignored.

So if the following record is in staging area:

  • Person (1)
    • Id: 1
    • Name: A Person name
    • Address: 9999 -> address with id 9999 does not exist

The following record will be inserted into the master database:

  • Person (2)
    • Id: 1
    • Name: A Person name
    • Address: null

In this case, Person record (1) will be marked as valid even if FK is incorrect because record (2) in master database has a valid empty FK.

If the column “Address” in the “Person” entity is defined with minOccurs=1 in the data model, an empty FK will raise an error (Address FK cannot be null). In this case, record (1) will have an invalid status and record (2) will not exist in the master database.

Order of records

The staging task validation of records will be performed in an order that guarantees no FK constraint will be broken. The task works a list of entities ordered based on dependencies between entities.

Here, a dependency refers to an FK to another type (a dependency to itself is not considered as one), where the FK has "FKIntregrity=true". MDM users can indicate in the data model that they expect a constraint to validate the FK value. A dependency is also either an FK declared in the type or an inherited dependency.

If you have the following dependencies:

A -depends-> B -depends-> C

The task will work first on records of type "A", then B and finally C ([ "A", "B", "C" ]).

If you have the following dependencies:

A-depends->B-depends->C

D -extends-> B

The order will be [ "A", "B", "D", "C" ]. Since "D" inherits from "B", "D" has an inherited dependency to "C".

Consequently, circular dependencies are not acceptable. For example, the dependencies A -depends-> B -depends-> C -depends-> A... failed to be loaded.

To resolve the issue, use the "FKIntegrity" value on an FK. The dependencies A -depends-> B -depends-> C –depends (FKIntegrity=false)-> A... works fine.

Dependency sorting is expected to run in linear time (O(n+p) where n is the number of entities in the data model and p the number of relationships between the entities). Therefore, processing records is expected to run in linear time, depending on the data model complexity.

Performance tweaks

You can tweak the staging area validation performance with the following mdm.conf properties:

mdm.conf property Type Description
staging.validation.updatereport boolean When it is set to "false", the update report creation is disabled during staging area validation. Setting it to "false" has a huge impact on performance since MDM will not look for beforeSaving processes to run. If you do not need beforeSaving process to be run during staging area validation, we recommend that you set it to "false".

By default, its value is "true".

staging.validation.pool int Indicates how many threads will perform MDM validation of records. By default, two threads are dedicated to record validation. You may increase this value if the machine running the MDM server has some unused CPU.
staging.validation.commit int (5.3 only) Tells MDM how large a MDM validation transaction can be. By default, MDM commits records every time a transaction holds 1000 validated objects. You may increase this value if you wish to use bigger transactions and this will limit commits on database.
staging.validation.buffer.threshold int

A buffer is used to transfer records from the staging area to the master database. Reading from the staging area is always faster than writing to master database, so the buffer size can be limited to avoid memory issues.

By default, the buffer will hold a maximum of 1000 records. When the threshold is reached, the reading from the staging area will be paused and the buffer will be checked every second to see if the buffer size decreased.

This property does not directly affect validation performance in terms of records/sec. Specifically, it prevents high memory usage, and this has an impact on garbage collector, so it can indirectly affect records/sec performance.

URL for REST APIIntroduction

For an example of the REST API usage, retrieve the rest_api_example.zip archive from the Downloads tab in the left panel of this page.

The MDM server exposes a REST interface that you can use to create/monitor/edit a staging task.

Depending on the "Accept" value in the HTTP request (see http://www.wikipedia.org/wiki/Hypertext_Transfer_Protocol for more information about the "Accept" value), the service might respond in different formats. Supported formats are "text/xml", "application/json", and "application/xml".

Depending on the product version you are using, the entry point is different.

All operations are described using this convention:

Operation name (1)

HTTP_Request HTTP URL sample (2)

text/xml (3)

Example of XML response

application/json (4)

Example of JSON response

(1) Quick summary of what the operation does.

(2) HTTP command (GET / POST / DELETE / PUT) and a URL sample.

(3) and (4): Samples of responses that depend on the Accept header in the HTTP request. For example, for the HTTP request with the Accept header of "Accept: text/xml", XML documents will be returned.

Staging area validation - REST API description

Note that the URLs for Release 5.X and Release 6.X are different. Make sure you use the correct URL according to the product version.

Read staging container summary (default)

GET

text/xml

<?xml version="1.0" encoding="UTF-8" 
standalone="yes"?><staging><data_container>TestDataContainer</data_container><data_model>TestDataModel</data_model><invalid_records>1000</invalid_records><total_records>10000</total_records><valid_records>8000</valid_records><waiting_validation_records>1000</waiting_validation_records></staging>

application/json

{"staging":{"data_container":"TestDataContainer","data_model":"TestDataModel","invalid_records":1000,
"total_records":10000,"valid_records":8000,"waiting_validation_records":1000}}

Read staging container summary

GET

text/xml

<?xml version="1.0" encoding="UTF-8" 
standalone="yes"?><staging><data_container>TestDataContainer</data_container><data_model>TestDataModel</data_model><invalid_records>1000</invalid_records><total_records>10000</total_records><valid_records>8000</valid_records><waiting_validation_records>1000</waiting_validation_records></staging>

application/json

{"staging":{"data_container":"TestDataContainer","data_model":"TestDataModel","invalid_records":1000,
"total_records":10000,"valid_records":8000,"waiting_validation_records":1000}}

Read list of executions - no paging

GET

text/xml

<executions><execution>fa011993-648f-48b3-9e4d-9c71de82f91a</execution><execution>4ad4e1c7-7769-45c1-90ad-16b54aa0262b</execution></executions>

application/json

 {executions:["fa011993-648f-48b3-9e4d-9c71de82f91a","4ad4e1c7-7769-45c1-90ad-16b54aa0262b"]}

Read list of executions - with paging

GET

text/xml

<executions><execution>fa011993-648f-48b3-9e4d-9c71de82f91a</execution><execution>4ad4e1c7-7769-45c1-90ad-16b54aa0262b</execution></executions>

application/json

 {executions:["fa011993-648f-48b3-9e4d-9c71de82f91a","4ad4e1c7-7769-45c1-90ad-16b54aa0262b"]}

Get execution details

GET

text/xml

<?xml version="1.0" encoding="UTF-8" 
standalone="yes"?><execution><end_date>2012-08-02T11:20:07.188+02:00</end_date><id>fa011993-648f-48b3-9e4d-9c71de82f91a</id><invalid_records>973</invalid_records><processed_records>772.0</processed_records><start_date>2012-08-02T11:20:07.188+02:00</start_date><total_record>772</total_record></execution>

application/json

  {"execution":{"end_date":"2012-08-02T11:20:07.188+02:00","id":"fa011993-648f-48b3-9e4d-9c71de82f91a","invalid_records":973,"processed_records":772,"start_date":"2012-08-02T11:20:07.188+02:00",
"total_record":772}}

Run a validation task

POST

text/xml

  1ad084c1-5f70-4b89-aeef-613e7e44f134

application/json

 1ad084c1-5f70-4b89-aeef-613e7e44f134

You can refine the validation based on a filter, which is provided as an XML document within the body request.

All nodes in the filter are optional.

Below is a filter example:

<config>
  <start-date>1447929165000</start-date>
  <end-date>1447973999000</end-date>
  <status-codes>
     <code>204</code>
     <code>404</code>
  </status-codes>
  <concepts>
    <concept>Product</concept>
    <concept>ProductFamily</concept>
  </concepts>
</config>

Get current validation task status

GET

text/xml

<?xml version="1.0" encoding="UTF-8" 
standalone="yes"?><execution><id>1ad084c1-5f70-4b89-aeef-613e7e44f134</id><invalid_records>5</invalid_records><processed_records>10.0</processed_records><start_date>2012-08-02T11:20:16.887+02:00</start_date><total_record>10000</total_record></execution>

application/json

{"execution":{"id":"1ad084c1-5f70-4b89-aeef-613e7e44f134","invalid_records":10,"processed_records":20,"start_date":"2012-08-02T11:20:16.887+02:00",
"total_record":10000}}

Cancel current validation task

DELETE

text/xml

null

application/json

null

Get current validation task status (after cancel)

GET

text/xml

null

application/json

null

Check if current data container supports staging

GET

Response:

text/xml

true  (if the current data container supports staging)

or

false  (if current data container does not support staging)