Best Practices: MDM Modelling - 7.0

author
Adam Pemble
EnrichVersion
7.0
EnrichProdName
Talend Data Fabric
Talend MDM Platform
Talend Open Studio for MDM
task
Data Governance > Modeling data
EnrichPlatform
Talend MDM Server
Talend Studio

Best Practices: MDM Modelling

This article defines the best practices to adopt when creating a physical MDM model with Talend.

Overview

This article serves as a guide to the techniques, standards and best practices that should be employed when creating a physical MDM model with Talend. This article is targeted at the commercial version of Talend, but a subset of the functionality can be found in the Open Source product.

The following best practices have been compiled from many years of experience in implementing Talend MDM by the Talend Professional Services team. For further assistance, please contact your Talend Account Manager.

This article does not attempt to outline how a given domain (for example, Customer) should be modelled. The physical MDM model design is driven by :

  • An organization-wide consensus of the definition of each master data entity – this is usually not just the model as defined by an individual source system. The driving force for obtaining this consensus should be the data governance group, with representation from all stakeholders (business and IT). The result of this initiative is a set of conceptual and logical models which are used to help design the MDM physical model.
  • Data profiling and analysis.
  • The business processes that the organization requires to be enacted upon the MDM model. This most likely includes business processes that do not exist in the organization before the MDM project. Thus, business process analysis (current and desired) is a critical pre-requisite to the implementation of the right MDM physical model design.

Once the organization is ready to begin creation of an MDM physical model, this guide should be employed to ensure a consistent and robust implementation.

Prerequisites

  • You have completed the Talend MDM Modelling Training class.

Definitions

This section defines the terms used when creating MDM models.
Term Definition
Domain A set of related entities that collectively describe a type of master data within an organization, for example Customer, Product or Supplier.
Multi-domain MDM tool An MDM product, such as Talend, that is not specifically created to focus on a single domain and can therefore master multiple/any domains, concurrently if required.
Entity In Talend, entity generally refers to a set of rules that a single record within Talend MDM must conform to. However, in the industry the term is often used interchangeably with what is defined as a super-entity – a set of entities with relations (often one to many) that collectively define an atomic unit of something. For example, it is common for a 'customer' (super-)entity to be physically modelled using multiple entities in the model – 'customer', 'address', 'customerXref', etc. The term entity is also sometimes used in the industry to describe a physical instance of an entity – for clarity, this article uses the term record for this.
Record An instance of an entity – a physical record as represented at the MDM application level. This is an XML document.
MDM Physical Storage A human readable, third normal form representation of the data held in the MDM hub. This is the physical storage of Talend MDM and is held in a relational database system. Normally, the data is not accessed at this layer, but rather via the MDM Application layer, which provides a layer of abstraction over the requirements of physical storage and governance rules. It must never be written to directly. Writes must go through the application layer, but read is allowed in limited circumstances.
Element An XML element is the building block of an XML document. Each XML document contains one or more elements, the scope of which is either delimited by start and end tags, or by an empty-element tag for empty elements. Thus, in Talend MDM terminology, an element may be defined as being a simple type, in which case it holds data (or a foreign key), or as being a complex type, which holds other elements.
XML Simple Type (adapted from W3Schools) A simple element is an XML element that can contain only text. It cannot contain any other elements or attributes. It can be one of the types included in the XML Schema definition: boolean, string, date, etc., or it can be a custom type that you can define yourself by extending the definition of one of the built-in types.
XML Complex Type (from W3Schools) A complex type element is an XML element that contains other elements and/or attributes.
Lower Camel Case Camel case is the practice of writing compound words or phrases so that each next word or abbreviation begins with a capital letter. With the Lower variation of the Camel Case, the first word/letter is lowercase, for example: 'iPhone', 'customer', 'customerXref' and 'refDataItem'.

Naming conventions

This section defines naming conventions for MDM models.

Naming and casing recommendations are a matter of personal preference. The approach presented in this article leverages a Lower Camel Case style. If you choose not to adopt the Talend style, it is important to define your own standards and implement them consistently.

Note: Talend is case sensitive.

Models and containers

This section defines the naming conventions for models and containers.

Talend MDM data models must always have the same name as their related data containers. You are given the option to automatically create the container when creating the model.

Lower Camel Case should be used for the model name. The data model name generally refers to the domain that is mastered. Try to avoid generic terms like 'master', 'mdm', 'hub' and so on.

Here are some examples of good model names:

  • 'customer'
  • 'xyzIncCustomer'
  • 'product'
  • 'xyzIncProduct'

Entities

This section defines the naming conventions for entities.

Entity names should be human readable, recognisable to a business data steward and non system-specific. For example, if you wanted to have an entity to represent an organization-wide view of a vendor or supplier and one of your source systems was SAP, the vendor entity in MDM should not be called 'LFA1' (the SAP name for the object that holds top level vendor information). In this scenario, 'vendor' would be a more appropriate name, and there may not even be a 1:1 mapping between a SAP vendor and the organization-wide MDM definition of a vendor. This depends on business requirements.

Again, Lower Camel Case should be used :

  • 'customer'
  • 'address'
  • 'refDataItem'
  • 'customerXref'

The entity names should be short and can be reasonably technical, since it is possible to add a more business-friendly label for users of the Talend MDM Web UI. This will be covered later in this article.

Primary Keys

This section defines the naming conventions for primary keys.

You should use the name of the entity followed by Id, in Lower Camel Case. The purpose of this syntax is to aid mapping in the integration layer.

Foreign Keys

This section defines the naming conventions for foreign keys.

You should use the name of the foreign entity followed by 'Fk', in Lower Camel Case. If it points to a foreign entity that contains multiple lists of data, such as the shared reference data entity discussed later in this article, then using the list name is reasonable. If it points to an entity called 'refDataItem' which is used to hold lists of titles, genders and colors, a foreign key to this entity could be called 'titleFk'.

The purpose of this syntax is to aid mapping in the integration layer.

Simple elements

This section defines the naming conventions for simple elements.

You should use Lower Camel Case and the name should be concise and human readable. There is no need to repeat the entity name in the element name. For example 'first' should be used for a first name field in an individual entity, rather than 'individualFirst'. Again, any field in the model can have a more user friendly label applied, so the element names are the technical names that would be used in ETL, services, etc.

Complex types

This section defines the naming conventions for complex types.

You should use Lower Camel Case and the name should give a concise description of the contents of the complex type. Three primary scenarios where complex types are used can be identified:

  • To logically encapsulate a set of related fields together. For example, use the naming convention 'name of the group of fields' + 'Details'. If you decide that a person can optionally have one and only one home, mobile and fax number, you can contain these fields in an element named 'phoneDetails'.
  • An element that contains a repeating list. Repeating lists should always be contained within a parent element, as will be discussed later in this article. Here you can use the 'name' + 'List' convention. For example, a list of email addresses can be named as follows.
  • An element that contains a complex type, that is itself repeating within a parent list element. The suggested convention here is 'name' + 'Item'. For example:

Reusable types

This section defines the naming conventions for reusable types.

It is tempting to use a convention of 'name' + 'Type' for naming your reusable types. However, this can be problematic if you need to use the word type within your element name as well. For example, if you had a field that defined a person type – perhaps constrained to being either a 'customer' or a 'prospect' – then the reusable type that defines your enumeration of values would be called 'personTypeType'. This is obviously not helpful. To counter this and make your reusable type names distinct from your element names, you can use the following syntax: 'name' + '_T', where 'name' is in Lower Camel Case. For the example above, the reusable type name would be 'personType_T'.

MDM ID Generation

This section describes the different ways to generate IDs with Talend MDM.

Every record in Talend MDM requires a unique primary key – unique within a given entity. Compound keys are not recommended. As a general best practice, this unique primary key ID should be generated by Talend MDM for master data records and can be optionally generated by Talend MDM for reference data, depending on the business processes surrounding the reference data and the model design. This is because generation of IDs by Talend MDM addresses the following issues:

  • Concurrency: with Talend MDM generating the keys, it is impossible for two records to be given the same key, even in high concurrency, real-time scenarios.
  • Generation of an organization-wide unique MDM identifier: an MDM record could potentially consist of data from records in source systems A, B and C, possibly from multiple records in each system. Therefore, it makes little sense to use the identifier generated from a single system as the primary key, especially as the life of the MDM record may persist beyond the life of that single source record. Generally, you will also master cross references to these source records, linked to your unique MDM ID, but it makes sense to be able to uniquely and consistently identify an entity that is critical to doing business by an ID that is recognised and understood by the whole organization, not just a single system at a single point in time.

Some reference data may have a unique key managed by a trusted external source. For example, if you need to hold ISO country information in your MDM hub and you decide to adopt the ISO three character format as your enterprise standard, you can trust that the ISO organization will not change the data to introduce a duplicate code, when compared to previous releases. Therefore, it may be appropriate to use the ISO three character code as your primary key in MDM, especially in this case, where the rate of change of the data is very slow.

Talend MDM provides two different simple types for ID generation: AUTO_INCREMENT and UUID.

AUTO_INCREMENT

AUTO_INCREMENT is a sequence number which increments by one every time a record is created and starts at one by default. The base type is a string and must not be changed.

The counter's current values are stored in the CONF container:

They can be reset automatically using a Job or service. Alternatively, they can be reset manually using Talend Studio, with the Manage AutoIncrements button on the data container browser.

Obviously, if they are reset to a value, that means the next generated ID will conflict with a record already existing in MDM and an overwrite will occur.

AutoIncrement IDs are not guaranteed to be complete or sequential. If you were to create records 1, 2 and 3 and then delete record 2, the next generated ID would not be 2, and it is not guaranteed to be 4, although it usually would be.

Note: The AUTO_INCREMENT type is not recommended for use in entities that require high load performance in incremental batch Jobs or real-time services.

Specifically, it should not be used:

  • When the performance requirements (volume, speed) of a batch process dictate that multiple write threads should be used concurrently via the Talend MDM components (MDM SOAP API underneath) or via the MDM REST API.
  • When using Talend ESB to create master data services that are required to be called by many concurrent clients during peak system usage.
  • When the use of the MDM REST API is required in high concurrency/volume scenarios.

Typically, the definitions above cover the core master entities in a model. Thus, the best practice is usually to use UUID and not AUTO_INCREMENT for master data entities.

UUID

UUIDs, or Universally Unique IDentifiers, in their canonical form, are represented by 32 hexadecimal digits displayed in five groups separated by hyphens, in the form 8-4-4-4-12, for a total of 36 characters. For example:

09bf989f-5b24-47bc-871e-1e824d4f4c60

According to Wikipedia, the chances of a UUID collision (generating an already existing UUID) are incredibly low:

"Only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. Or, to put it another way, the probability of one duplicate would be about 50% if every person on earth owned 600 million UUIDs.”

UUIDs have several advantages over AutoIncrement IDs:

  • No need to use Hazelcast to orchestrate key generation in an MDM cluster
  • IDs can be easily generated outside of MDM as well as inside MDM if needed (typically done when inserting via the bulk loader in an initial load process)
  • No need to reset counters when resetting the hub back to a known state (for example: an empty initial state, commonly done in testing)
Note: It is also possible to use the AUTO_INCREMENT and UUID types for non-primary key fields. This can be useful in certain circumstances, for example to allocate a unique ID to an entry in a repeating list of values.

Reusable vs Anonymous types

This section describes the differences between reusable and anonymous complex types.

There are two primary reasons to name a complex type:

  • Reusability: It will appear in the reusable types section of the modeller
  • Readability of physical storage

If a type is not named, it will be listed as being an anonymous type in the model.

Entity level

As a general rule, the complex type that defines a complete entity can be left as anonymous. There is usually no benefit to making the type reusable. In terms of the physical storage of MDM, there are no advantages since the table name is taken from the entity name.

Complex types within an entity

Whether a complex type with an entity needs to be reusable depends on how you define your cardinalities. It can of course be reusable if there are benefits to doing so, that is if you need to use this type in multiple entities.

In the following example, a reusable type is not required:

While this is a hierarchical structure, 'testComplex' and its child element 'stuff' can both only appear a maximum of one time. Thus, when they are manifested in the physical storage, the child elements are in the entity level table 'test'. Only elements that can contain data appear in the physical storage, so there is no benefit to having fields for 'testComplex' and 'stuff', since they cannot hold any actual data.

However, should you change the cardinality of 'stuff' to allow multiple instances of it within the list:

Now the physical structure has become more complex, with multiple tables required to store the data as it is modelled. The names of the new tables are not particularly helpful.

Now you can name your types.

It is clear from the physical table names which tables belong to your 'test' entity.

Finally, here is what happens when you create a second entity that reuses the type 'testComplex_T':

You can now see the reuse in action in the physical storage.

The MDM application layer operates at a higher level of abstraction than the physical storage, which is in third normal form and is not actually aware of how the data is physically stored. The application layer understands records as XML documents with relationships to other XML documents. This is a Talend extension to the base definition of XML. However, you can see from this exercise that there are benefits to the physical storage in applying some modelling best practices. Other examples of this are detailed later in this article.

Simple element typing

This section explains how to type simple elements.

Simple elements can use one of the default XML types:

Or one of the provided custom types:

For a complete list of types and the mapping to the physical storage, see MDM data model to RDBMS mapping.

By properly typing your simple elements, you define a set of validation rules to which your master data must comply. A simple example: you have a date of birth field in an entity. You could hold this date in a string field, but:

  • There would be no validation that a correct date was entered at the MDM application level. Of course, if data is ingested using ETL or services, the validation could be within the integration logic.
  • In the Talend MDM Web UI, you would not see the date picker widget.
  • The date would be stored as a string (VARCHAR) in the physical storage.
  • You would not be able to perform date comparison searches.

The obvious thing to do here is type the date of birth field as a date type, which can of course be formatted to display in your chosen format (see the MDM training). However, if you also had unknown, N/A or default placeholder dates in your source data, these would not validate as date values, with the possible exception of the placeholder dates. This will be discussed in the Cardinality section.

The following example shows a surname field, which does not need to hold string data. Initially, it would seem sensible to use the 'string' type within your model, but this approach leads to a problem that is brought by the use of an RDBMS as your physical storage layer.

You can see that in the example above, the default string type has been used, and in the physical storage the x_last field is created as a VARCHAR(255). Here is what happens when you try to enter data with more than 255 characters:

You get an error in the Talend MDM Web UI and a database error in the Talend MDM Server log.

You can create a new simple type: 'string255_T', set a length restriction for this type, and apply it to your 'last' element:

And then try your large string save test again:

Your 'string255_T' gives you a red box and exclamation mark to indicate an error before you hit save, and a tooltip giving details of the error and a more user-friendly error if you click Save.

Of course, you can use this principle to enforce length restrictions other than 255 characters, and you can see the effect this has on the physical storage.

Strings lengths greater than 255 characters are a special case.

The use of the MySQL LONGTEXT type is automatic for any value greater than 255. The exception to this is Oracle, where a VARCHAR up to length 4000 is used. In most scenarios, it is unlikely for anything that can be described as master data to have this many characters.

Having examined string lengths, you can conclude that strings in a Talend MDM model should always be given a maximum length to:

  • Enforce business rules as to the maximum length of a given field
  • Give meaningful error messages
  • Optimise the physical storage: there is no point in using 255 characters in the database if you never populate more than 10 characters, this would be a waste of physical storage resources

When defining custom simple types, you can also add rules, known as facets, for the following:

  • length: set a fixed length for the element
  • minLength: minimum length
  • pattern: the element has to conform to a regular expression pattern
  • enumeration: fixed set of values, for more information, see the Rules section
  • whitespace: preserve, replace or collapse

When setting a facet value, you may want to set a facet message to give a more business-friendly tooltip in the user's language of choice.

Foreign key fields

A foreign key field should always be a string. An optional best practice is to set a length restriction on an FK field. However, remember that an MDM foreign key is bounded by square brackets, [123] for example, so any length restriction should take the two extra characters into account.

Cardinality

This section defines the best practices for cardinality.

According to Google dictionary, cardinality is "the number of elements in a set or other grouping, as a property of that grouping".

The primary key element of a Talend MDM entity must have a cardinality of 1...1, which means it must appear once and only once in each entity. Any other element can be configured with varying cardinality as defined below.

Minimum Maximum Description
1 1 A single occurring mandatory element
0 1 A single occurring optional element
1 0 (or -1 or many) An element that must occur at least once, but with no upper limit on the number of occurrences
0 0 (or -1 or many) An optional element, but with no upper limit on the number of occurrences
1 3 (or any other integer greater than 1) An element that must occur at least once, with an upper limit of three (or any other integer greater than 1)
0 3 (or any integer greater than 1) An optional element, with an upper limit of three (or any other integer greater than 1)

Cardinality is displayed on each element of the model, except where it is set to 1...1.

At this point, you need to define the three XDS Order indicators:

  • All
  • Sequence
  • Choice

The section below is taken directly from the following page: XSD Indicators.

Order indicators are used to define the order of the elements.

All indicator

The <all> indicator specifies that the child elements can appear in any order, and that each child element must occur only once.

<xs:element name="person">
 <xs:complexType>
  <xs:all>
   <xs:element name="firstname" type="xs:string"/>
   <xs:element name="lastname" type="xs:string"/>
  </xs:all>
  </xs:complexType>
</xs:element>
Note: When using the <all> indicator, you can set the <minOccurs> indicator to 0 or 1 and the <maxOccurs> indicator can only be set to 1. The <minOccurs> and <maxOccurs> are described later.

Choice indicator

The <choice> indicator specifies that either one child element or another can occur.

<xs:element name="person">
 <xs:complexType>
  <xs:choice>
    <xs:element name="employee" type="employee"/>
    <xs:element name="member" type="member"/>
   </xs:choice>
  </xs:complexType>
</xs:element>

Sequence indicator

The <sequence> indicator specifies that the child elements must appear in a specific order.

<xs:element name="person">
 <xs:complexType>
  <xs:sequence>
    <xs:element name="firstname" type="xs:string"/>
    <xs:element name="lastname" type="xs:string"/>
   </xs:sequence>
  </xs:complexType>
</xs:element>

From a Talend MDM perspective, the order indicators that are generally used are All and Sequence. It is possible to use Choice, but it is rare to find MDM scenarios that can use it.

You should define the order indicators at this step because they are directly linked to a discussion on cardinality. By default, when you create an entity in Talend MDM, the anonymous type of the entity has an order indicator of All.

This is important because any element that you define within this entity can be sent to Talend MDM in any order. For this example, you can send the following XML to create a new 'newEntity' record in your MDM.

<newEntity> 
         <stuff>hello</stuff>
         <moreStuff>world</moreStuff>
</newEntity>

Or you could sent the XML as:

<newEntity> 
         <moreStuff>world</moreStuff>
         <stuff>hello</stuff>
</newEntity>

The order of the elements 'stuff' and 'moreStuff' is not important.

Note: There is no value set for 'newEntityId', this is of type AUTO_INCREMENT and therefore generated by MDM.

Here is what happens when you change the cardinality of the 'stuff' element from 0...1 to 0...many:

Now you can have more than one 'stuff', but all elements in the entity have to be supplied in the correct sequence.

<newEntity> 
         <stuff>hello</stuff>
         <moreStuff>world</moreStuff>
</newEntity>

The example above would be valid.

<newEntity> 
         <stuff>hello</stuff>
         <stuff>hi</stuff>
         <moreStuff>world</moreStuff>
</newEntity>

The example above would also be valid.

<newEntity> 
         <moreStuff>world</moreStuff>
         <stuff>hello</stuff>
</newEntity>

The example above would be invalid, due to the order of the elements. From an integration perspective, defining the loop (the repeating element) at the root entity level means you have to be incredibly precise in your integration logic. It can also mean that the integration logic is difficult to read and to understand, especially when you have more than one repeating element.

As defined earlier in this article, it is better to encapsulate your repeating elements within parent complex types.

Single repeating element:

Repeating complex type:

If you do this, the root anonymous type of your entity can use the order indicator All (no restriction on order) and only the elements (not their children) within the repeating loop are constrained by a sequence.

Finally, in the party/phone example above, you can examine the cardinality of the child elements of 'phoneNumberItem'. You can see that the three elements 'number', 'type', and 'marketingOk' are all configured with a cardinality of 1...1. However, examine the cardinality of the 'phoneNumberItem' and 'phoneNumberList' elements: these are optional. This means that:

  • If you supply no data for phone numbers, the entity will validate.
  • If you want to supply one or more phone numbers ('number'), the fields 'type' and 'marketingOk' would also be required, otherwise the entity would be invalid.

Loops within loops (nested loops)

This section describes the use of nested loops.

Consider that you want to model companies, the employees of a company and the email addresses of an employee, and that a company can have 0...many employees and each employee can have 0...many email addresses.

You could model this with a single MDM entity.

However, this has several disadvantages:

  • You would end up with one extremely large XML document for larger organizations, which would be hard to browse, process, search, view, etc.
  • The tXMLMap component in the Integration perspective of Talend Studio cannot easily (or at all, depending on the scenario) create such structures. You would need to use Talend Data Mapper, the Talend complex data mapping component which is more complicated to use than a tXMLMap and has potential performance implications.
  • You may be interested in a scenario where an individual works for more than one organization; this hierarchical approach cannot represent this.
  • You cannot use an MDM partial update to insert, update or delete an item in the email list. To use a partial update you need to be able to specify a pivot XPath, the XPath to the element that loops. But there are two pivots in this case:
    • company/employeeList/employeeItem: the loop for each employee
    • company/employeeList/employeeItem/emailList/email: the loop for each email of each employee
    As you can only specify one pivot, in the tMDMOutput component for example, a partial update on nested loops is not possible.

There will be more information on this in the following section. To summarize, the best practice is to avoid using loops within loops.

Relational vs Hierarchical vs Object Orientated

This section describes different modelling approaches.

As discussed in the Talend MDM modelling training course, there is no requirement to model in third normal form (3NF) in Talend MDM. You can model at a level of abstraction from the physical storage (RDBMS), at a level that suits your requirements in terms of business processes that need to be enacted upon your master or reference data. In the previous section (Loops within loops), you saw that a purely hierarchical approach is also not desirable in many scenarios. As well as being able to model relationally and hierarchically, you can also model in Talend MDM using Object Orientation principles. Some models will combine all three modelling approaches. There is no right answer as to how to model something, but there are some guiding principles:

Leverage relational modelling where it is beneficial to your processes

With many to many relationships, for example if you have two entities : 'customer' and 'address', a customer might have many addresses and an address might be related to multiple customers. By modelling relationally – two entities with foreign key relationships – you can leverage the value of mastering the relationship. In this example, there could be value to being able to do a simple query that establishes that two customers have the same address. If 'address' was part of the 'customer' entity, you would have to query based on the values of your address (or some sort of hash key). The address would be held twice within MDM and you would be matching as part of a search process. If you model relationally, the two 'customer' entities would be related to the same 'address' entity. You can achieve this by cleansing and standardising your addresses with an external address validation tool (QAS, Logate, etc.) and then matching on 'address' as part of your load processes to ensure that duplicates are not inserted. Thus, the query is a simple "give me all the Customers with a foreign key relationship to Address X".

This principle can also be extended to reference data. Take for example the ISO country code example discussed earlier in this document. If you use country codes as part of an address, you can have a simple string(3) field to hold the code. However, if you also want to have the descriptive name, cross references to other source systems, a relationship between country and language, etc., putting all this information into the address entity is needlessly complex and a wasteful repetition of data. It is also error prone, as there is no validation of the ISO data as part of the model. Instead, it is much better to have an 'isoCountry' entity, or the reusable reference data entity discussed later in this document. By doing so, data that would be repeated in the denormalised approach is now held in one single record that can have governance processes designed for it. Changing the data in this single record automatically propagates the change to related records by virtue of the relationship. The additional country data can even be displayed in the Talend MDM Web UI, in the 'address' entity, using Foreign Key Infos. From the perspective of the 'address' entity, invalid 'country' data is now impossible to enter. In the Talend MDM Web UI, this manifests as a picklist for the user to select the relationship. Free-form text is not allowed.

Use hierarchical modelling to remove the need for join tables

While the previous section argues for the use of relationships where there is value to your MDM processes to do so, it is not an argument for a fully normalized model. In fact, a normalized approach can:

  • Increase integration complexity
  • Decrease performance substantially by having to insert into many entities at the same time to represent the master data of a single 'business' entity
  • Make the use of the Talend MDM Web UI a less optimal user experience, or even lead to orphan records and data inconsistency

A classic example of this is the many to many scenario discussed in the previous section. While there is value in having the relationship explicitly defined, modelling this structure in a flat, normalized approach would require three entities in your model: 'customer', 'address', and 'customerAddress' (to hold a record per relationship that exists between the two entities).

Instead, it is usually better to collapse the join table 'customerAddress' into one of the main entities as a repeating list of relationships. In this specific case, the list is normally collapsed into 'customer', which contains an element 'addressList' with a list of 0...many (or cardinality of your choice) address relationships. The complex type used to hold the relationship can also hold elements that provide context about the relationship, for example whether an address is a mailing or shipping address.

In the integration layer, it is very simple to perform the "give me all customers with this address" query. But if you were looking at the data in the Talend MDM Web UI, you may think that it would not be possible to do this without the join table. However, there are two approaches to this:

  • Browse to the address record you are interested in and click the Relations button, this allows you to reverse-engineer the relationships, even though the address record itself does not define any relationships.
  • Set up your MDM view searchable parameters for the customer view to allow searching by an address. The picker will allow this search based on both the technical key and data values you configure as Foreign Key Infos.

Finally, when dealing with repeating lists of values, you need to consider the following:

  • Do you need to update or delete values in the list? If so, you need a way of uniquely identifying an entry in the list. If the list does not naturally have a unique key, you may need to add a surrogate key to facilitate these types of actions. As mentioned earlier in this document, AUTO_INCREMENT and UUID types can be used for this, or perhaps a unique external key if the insert action is not allowed via the Talend MDM Web UI.
  • Do you need to cross reference entries in the list back to uniquely identifiable records in one or more source systems? If so, how will you hold these cross-references without introducing nested loops? This is a relatively rare scenario, but it does occur when a source system requires you to provide it with an ID to update or delete something, as opposed to:
    • Providing the full parent 'customer' record
    • Looking up the ID to update in the source system on the fly
    • Giving a before and after data instruction such as "Change: email: a@bc.com to ab@c.com"
    How you deal with updates to repeating list elements varies greatly depending on the business process, but it is important to consider how these updates will be done when designing the model.

Use Inheritance only where it is valuable

Conceptually, the idea of using inheritance in an MDM model can be appealing. Consider a simple example with addresses:

Here you see the Address base type holding the information that is common to all addresses. You also have UK and US-specific precisions on the base address that add country-specific fields. However, consider that:

  • Having to determine the specialization type in the integration layer at runtime adds considerable complexity to the integration logic.
  • You could actually achieve the same effect through relational modelling. Although there would be a small performance hit, the logic would be simpler.
  • It might actually be better to define a global address standard into which all types of address can be input. This is often the best approach of all, as it means that an organization-wide consensus on structure and process has been reached and the integration is simple and efficient.

Inheritance does have its place in Talend MDM, and this section will be expanded in a future release.

Reference data pattern

This section defines some best practices regarding reference data.

When creating MDM entities, you will often see the same pattern emerging, especially when dealing with reference data. For example, you may want to hold a list of titles instead of having a free text box for a title:

<title>		
 <titleId> Unique identifier for this title </titleId>
 <name> The actual master reference data value, e.g. ‘Mr’, ‘Mrs’ etc. </name>
 <description> Optional description of this value </description>
 <sourceXrefList>
  <sourceXrefItem>	0..many repeating list
   <sourceSystemFk> Optional FK to a source system entity </sourceSystemFk>
   <sourcePk> Optional: the PK of this data in a source system </sourcePk>
   <sourceLabel> This value as it is in source or a synonym. E.g. for ‘Mr’ in MDM, the value in a given source system might be ‘Mister’. Or it could just be a synonym for use in lookups / synonym indexes and not linked to a particular source system. </sourceLabel>
  </sourceXrefItem>
 </sourceXrefList>
</title>

Now consider that you also want to hold a list of genders:

<gender>		
  <genderId> Unique identifier for this title </genderId>
  <name> The actual master reference data value, e.g. ‘Male, ‘Female’ etc. </name>
  <description> Optional description of this value </description>
  <sourceXrefList>
    <sourceXrefItem>	0..many repeating list
      <sourceSystemFk> Optional FK to a source system entity </sourceSystemFk>
      <sourcePk> Optional: the PK of this data in a source system </sourcePk>
      <sourceLabel> This value as it is in source or a synonym. E.g. for ‘Male’ in MDM, the value in a given source system might be ‘M’. Or it could just be a synonym for use in lookups / synonym indexes and not linked to a particular source system. </sourceLabel>
    </sourceXrefItem>
  </sourceXrefList>
</gender>

You can notice a pattern. With many reference data entities, you would end up bloating your model and building many different integrations for the same repeated pattern. Instead, it is recommended to use a set of reusable reference data entities.

refDataList: a list of lists

Here the description would be the list name, for example 'title', 'gender', etc.

refDataItem: an entry in a list

This is the same structure as your title and gender structures above, but genericized and with the addition of a foreign key to a refDataList. This allows you to easily group all reference data of the same type together and provides some governance and control over the authorship process.