How to: Talend for "Data Vaults"
This article tutorial details the Job design patterns and best practice recommendations for building data population and extraction procedures for a Data Vault using Talend.
Enterprise Data Warehouses have evolved. Big Data ecosystems have emerged as a predominant storage destination and the 'Data Lake' has become the preferred terminology for raw business data processing. Directly impacted from this significant paradigm shift are data modeling methodologies. STAR schemas, while robust and popular in their day, have become insufficient in this Brave New Data World. Fact and Dimension tables can be forced into a Hadoop environment, however they simply fall short in various ways; on the update requirements in particular. STAR schemas also generally require highly cleansed data and do not adapt well when upstream systems change. These factors make utilization of Big Data STAR schemas problematic at its best and destined for disappointments.
Data Vaults to the rescue! Dan Linstedt, author and inventor of Data Vault 1.0 and 2.0 understood that 3NF and STAR schemas simply fail to deliver on the promise of Enterprise Data Warehouses ( What is "The Data Vault" and Why Do We Need It? ). Data Modeling is the 1st essential step in the process of creating and managing Enterprise Data Warehouses and Data Lakes on Big Data ecosystems ( Beyond "The Data Vault" ). Moving data into and getting data out of these infrastructures is the next essential step. This is where Talend comes in ( Talend And "The Data Vault" ).
Please read the blog links in the paragraph above for some important background first. Then proceed for details on how to build Talend Jobs for Data Vaults.
|All of the Jobs in this DV Tutorial have been designed using best practices in canvas layout, documentation, error handling and flow control (please read the Talend Blogs on Job Design Patterns Part 1 , Part 2 , Part 3, & Part 4 for more information).|
As a start to this tutorial the data model involved is presented below. This particular tutorial is focused on a common data workflow using relational data models only. As indicated in the blogs above, Big Data data models are also quite possible. Look for subsequent versions on this tutorial that demonstrates these variants. To fully understand this tutorial the three (3) data models here show a SOURCE database, a DATA VAULT database, and a DATA MART database.
The SOURCE data model is represented by the ACCOUNTS and USERS tables where an account record can have zero to many user records and a user record must have one and only one account parent record. This is defined physically as a Foreign Key on the Account_ID column in the USERS table to the ACCOUNTS table.
The DATA VAULT data model is represented by the hub_ACCOUNT , hub_USER , lnk_ACCTUSRS , sat_ACCOUNT , sat_ADDRESS , and sat_USER tables. The HUB tables contain the identified business keys and the SAT tables contain the attributes that change over time, all derived from the SOURCE database. The lnk_ACCTUSRS table provides the zero to many relationship of the hub_ACCOUNT and hub_USER tables as defined in the foreign key between the ACCOUNTS and USERS tables in the SOURCE data model . Physical Foreign Keys are defined for the LNK and SAT tables to the HUB tables as shown in the data model.Data Vault Model using Point-In-Time (PIT Tables)
A revised DATA VAULT data model incorporating PIT and BRIDGE tables can help achieve an interesting option: A 100% INSERT ONLY Data Warehouse!
Added are the pit_ACCOUNTS, pit_USERS and brdg_ACCTUSERS tables. The PIT tables contain the Hash Keys from each of the related HUB and SAT tables plus a Snapshot Date. As rows are inserted into the HUB and SAT tables, a record is also added to the PIT table. Notice that the 'rec_load_end_date' is no longer included in the SAT tables. This design eliminates the need to UPDATE the SAT tables, thus providing a 100% INSERT ONLY option. The PIT Tables are designed to setup an EQUA-JOIN on a HUB and it's related SAT tables. When a Snapshot is desired where two or more HUB tables are involved, use a BRIDGE table in a similar way.
Note that the PIT and BRIDGE tables have a tendency to grow rapidly so an archival mechanism may be warranted when large volumes of data are involved.
The DATA MART or De-Normalized data model is represented by the dn_AccountUsers table. This table contains the versioned, historical records extracted from the DATA VAULT data model to provide consistent analytics and reporting capabilities for current values and for values 'back-in-time'.
While this tutorial is not about how to design a DATA VAULT data model, from the ERD shown above, the basic ideas are shown; here is a summary of the 5 simplified steps:
|Identify Business Keys||Relational source tables generally include fields that represent static values providing meaningful, non-mutable, business identifiers; these are your Business Keys used in HUB tables; HUB tables should only include columns that contain values that never change.|
|Identify Attributes||The remaining attributes from source tables that represent values which do change over time are placed into SAT tables; SAT tables should be appropriately organized and contain the remaining columns that contain values that can change. Sometimes these source columns are better suited for LNK tables; for example transaction details on a ORDER>LINE ITEM structure.|
|Establish Linkages||HUB tables have unique surrogate keys, usually hash-keys set as primary keys, and never use any surrogate keys from their source as the primary key (those become business keys). SAT tables have surrogate keys that are composed of the parent HUB table primary key and the rec_load_date only. This establishes a foreign key relationship to the HUB table. Notice that a unique index on the HUB hash key and rec_load_date is created to allow for the foreign key from the SAT table. LNK tables create a linkage between HUB tables. LNK tables have unique surrogate keys, also best defined as a hash key set as a primary key. LNK tables then contain the corresponding hash keys from 2 or more HUB tables being linked together.|
|Control Lineage||Tracking when and where data originates is essential to a Data Vault data model. A rec_load_date provides the date as to when the record arrived into the Data Vault, NOT the date the record was created in the source table. The rec_source then provides location and/or name of the source table as it is possible for a single Data Vault table to contain records from multiple source tables. Generally HUB tables must contain these two columns. The rec_source column may be unnecessary in LNK and SAT tables however the Data Vault specification includes it.|
|Control History||As SAT tables accumulate historical versions of a particular record the rec_load_date and the rec_load_end_date provide a mechanism to allow storage of these records over time. The rec_load_end_date is null for the most current version of a record and contains the value of the next rec_load_date record when it arrives into the table. This requires an update operation on the previous record to set the rec_load_end_date and enables queries against the SAT tables using these two dates to pull the current values and values 'back in time'. Note that for a Big Data Data Vault these are differences here as update operations are highly undesirable or possibly not available.|
The use case for this tutorial is to setup and populate the SOURCE database with interesting data , extract this data into CSV files and then bulk loading into the DATA VAULT database, transforming the data to match the HUB/SAT/LNK tables. Once processed into the DATA VAULT database the tutorial extracts the data, transforming and loading a denormalized (or flattened) representation of the data into the DATA MART database. ETL_CONTROL tables are utilized in the SOURCE and DATA VAULT databases to to support the data processing of the delta (or latest) records from the SOURCE through to the DATA MART, then the DATA MART databases.