MDM Staging Area Validation
There are no huge differences for the database schema between Product" and "Product#STAGING". Their differences lie in that the staging area has:
- No FK relation: All FK relations are disabled, so users can load data in the staging area without taking into account the relationships between entities. In a sense, integrity constrains for the staging area are the same as those for the XML database.
- Additional column for source: Database schema has an additional text column where MDM users can provide additional details about the origin of the record. This column is a free-form and non-mandatory column. When loading data in the staging area, MDM users may indicate that the record Product with id "1" comes from SAP and Product with id "2" comes from another legacy system. This value could be used to create the "source" value for a Talend Data Stewardship (TDS) task.
- Additional column for status: This is an important column used to store actions done by
MDM on this staging record. This column contains a code, whose values are intentionally
similar to HTTP return codes and are described below:
- "000" or null: indicates that the record is a "new record".
- "2nn" values (200, 201...): indicates that the record has successfully passed the record validation process.
- "4nn" values (400, 401...): indicates that the record failed to pass the record validation process.
MDM components include a list box where you can choose between "Staging" and "Master".
How to load data in the staging area?
MDM users can load data in the staging area in one of the following ways:
- Use SQL components to execute INSERT statements on the database (using JDBC or DI components).
- Use MDM components tMDMOuput or tMDMBulkload with the data container name being "data_container_name#STAGING" instead of "data_container_name”. For example, "Product#STAGING" instead of "Product". Note that tMDMOutput does not allow insertion or update of invalid data. If you need to disable the validation checking of data records, use tMDMBulkload with the “Validate” option set to false .
How to transfer data from the staging area to the master database?
MDM provides both a UI and a REST access to trigger the transfer of data from the staging area to the master database. The transfer is called "staging area validation task" because it includes a step where records from the staging area are validated against MDM validation rules (XSD, Security, Validation rules). For more information, see the following sections related to the staging area validation.Staging area validation Introduction
Once the staging area is filled with new records, MDM users may want to transfer data from the "staging area" to the "master data record area". This can be easily done by creating a "Staging area task". A staging area task works on a given data container and sequentially performs the following operations:
- Identify similar records in the staging area, and group them to create a "cluster" of record. This phase is called "cluster identification".
- Merge records of a given cluster to create a unique record. For most cases, this can easily be done, for example, the cluster size may be equal to 1. For advanced cases, this phase will mark records of the cluster as records to be merged with a TDS task.
- Validate the merged record from the previous phase using a standard save operation and the merged record will then go through the save operation. Staging area task will mark records of the cluster either as "valid" or "invalid" depending on the result of the save operation.
- Create TDS tasks for records that need human interaction to be merged.
For each phase, the staging area task changes the value of the "status" column:
- 000: New record. This is the default value when a new row is inserted into the staging area.
- 201: Record passed the "cluster identification" phase successfully.
- 202: Record passed the "merge cluster" phase. It was grouped by other records and was used to create a golden record.
- 203: Record passed the "merge cluster" phase successfully but automatic merge could not make a trusted golden record because the confidence score in the golden record is too low.
- 204: Record passed the "merge cluster" phase successfully and this record is the unique (golden) record of the cluster. This record is the one used for MDM validation.
- 205: Record passed the "MDM Validation" phase successfully. This record then also exists in the master database.
- 206: Record was deleted.
- 207: (internal) Record was merged using a TDS task resolution.
- 208: (internal) Record needs a rematch.
- 401: Record failed to pass the "cluster identification" phase.
- 402: Record failed to pass the "merge cluster" phase.
- 403: Record failed to pass the "MDM validation" phase, due to a validation issue against the data model.
- 404: Record failed to pass the "MDM validation" phase, due to a constraint issue, for example, an FK constraint issue.
- 405: Record failed to be deleted due to a constraint issue, for example, an FK constraint issue.
All of the status codes are constants in the interface com.amalto.core.storage.task.StagingConstants (org.talend.mdm.core).
The status field is an integer in the database.
You can perform a SQL query such as
SELECT x_id FROM PRODUCT WHERE x_talend_staging_status > 200 AND x_talend_staging_status < 400
, and it will return all Product records that did not fail to pass any of the phases during the staging record validation process.
A staging area task is always run in the background. Note that there is an internal API to start a task and wait for its end.
Once the staging area task is started, all staging task execution statistics (such as the data container it runs on, how many records are validated, and how many records are left) are stored in the staging area database in the table "TALEND_TASK_EXECUTION".
Since there is no scheduling inside MDM, MDM users can start a staging validation task when they need to. For example, an MDM user may use Talend Administration Center (TAC) to schedule a staging validation task.
MDM users can use the web application dedicated to the staging area or use a Job with a tREST component to call the correct REST API. See the API section for more information.Limitations Recursive entities
For entities that have FK to itself (for example, Person might have an FK relation 'is child of' to Person), the validation process does not ensure the correct insertion order, so the validation of such records may fail even though the data integrity is correct.
To work around this issue, it is recommended to disable FK integrity checks for the FK field in the data model editor.
The validation process still guarantees the correct insertion order when working with different entities. For instance, if "Person" has a FK 'address' to the entity "Address", all "Address" instances will be validated before all "Person" instances.Unresolved foreign keys
In the staging area, MDM users can insert invalid values for FK. For example, the FK column "Address" in the "Person" entity may point to an incorrect id, which may be an ID that does not exist in "Address" instances.
The staging area allows invalid values for FK (users may insert later on an "Address" instance with the unresolved ID). However, during validation, invalid foreign keys are silently ignored.
So if the following record is in staging area:
- Person (1)
- Id: 1
- Name: A Person name
- Address: 9999 -> address with id 9999 does not exist
The following record will be inserted into the master database:
- Person (2)
- Id: 1
- Name: A Person name
- Address: null
In this case, Person record (1) will be marked as valid even if FK is incorrect because record (2) in master database has a valid empty FK.
If the column “Address” in the “Person” entity is defined with minOccurs=1 in the data model, an empty FK will raise an error (Address FK cannot be null). In this case, record (1) will have an invalid status and record (2) will not exist in the master database.Order of records
The staging task validation of records will be performed in an order that guarantees no FK constraint will be broken. The task works a list of entities ordered based on dependencies between entities.
Here, a dependency refers to an FK to another type (a dependency to itself is not considered as one), where the FK has "FKIntregrity=true". MDM users can indicate in the data model that they expect a constraint to validate the FK value. A dependency is also either an FK declared in the type or an inherited dependency.
If you have the following dependencies:
A -depends-> B -depends-> C
The task will work first on records of type "A", then B and finally C ([ "A", "B", "C" ]).
If you have the following dependencies:
D -extends-> B
The order will be [ "A", "B", "D", "C" ]. Since "D" inherits from "B", "D" has an inherited dependency to "C".
Consequently, circular dependencies are not acceptable. For example, the dependencies A -depends-> B -depends-> C -depends-> A... failed to be loaded.
To resolve the issue, use the "FKIntegrity" value on an FK. The dependencies A -depends-> B -depends-> C –depends (FKIntegrity=false)-> A... works fine.
Dependency sorting is expected to run in linear time (O(n+p) where n is the number of entities in the data model and p the number of relationships between the entities). Therefore, processing records is expected to run in linear time, depending on the data model complexity.Performance tweaks
You can tweak the staging area validation performance with the following mdm.conf properties:
|staging.validation.updatereport||boolean||When it is set to "false", the update report creation is disabled during
staging area validation. Setting it to "false" has a huge impact on performance
since MDM will not look for beforeSaving processes to run. If you do not need
beforeSaving process to be run during staging area validation, we recommend that you
set it to "false".
By default, its value is "true".
|staging.validation.pool||int||Indicates how many threads will perform MDM validation of records. By default, two threads are dedicated to record validation. You may increase this value if the machine running the MDM server has some unused CPU.|
|staging.validation.commit||int||Tells MDM how large a MDM validation transaction can be. By default, MDM commits records every time a transaction holds 1000 validated objects. You may increase this value if you wish to use bigger transactions and this will limit commits on database.|
A buffer is used to transfer records from the staging area to the master database. Reading from the staging area is always faster than writing to master database, so the buffer size can be limited to avoid memory issues.
By default, the buffer will hold a maximum of 1000 records. When the threshold is reached, the reading from the staging area will be paused and the buffer will be checked every second to see if the buffer size decreased.
This property does not directly affect validation performance in terms of records/sec. Specifically, it prevents high memory usage, and this has an impact on garbage collector, so it can indirectly affect records/sec performance.
The MDM server exposes a REST interface that you can use to create/monitor/edit a staging task.
For more information about the REST APIs for staging area management, see MDM query language and REST data access.
For an example of the REST API usage, retrieve the rest_api_example.zip archive from the Downloads tab in the left panel of this page.
Depending on the "Accept" value in the HTTP request (see http://www.wikipedia.org/wiki/Hypertext_Transfer_Protocol for more information about the "Accept" value), the service might respond in different formats. Supported formats are "text/xml", "application/json", and "application/xml".