You may have heard the term “Data Vault” being mentioned occasionally. In fact, I find that when speaking to our larger clients about the best way to organize information by developing a data model, two solutions are often brought up. The first is a relational model which we all know and love. The second, which many clients are less familiar with, is a data vault model.
Architecture of data vault provides an exceptional solution to technical issues and business problems alike. The methodology revolves complete around data integration efforts across the enterprise and is built from solid foundational concepts. During the last few years, Data Vault has increases in popularity providing a next-generation solution for tackling modern data warehouse problems.
What is a Data Vault?
A Data Vault is defined as a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.
Software, data teams, business processes generally change over time. The need for a new modelling technique arose because of the ever-changing nature of this. In traditional data warehouse modelling techniques, it is quite costly to make changes on a frequent basis. Data Vault, on the other hand, was designed to solve these problems.
Data Vault was established to enable auditing and storage of historical data, permit parallel loading, and allow companies with a variety of data sources to scale without starting a whole redesign of the current solution.
What are the Advantages and Disadvantages of Data Vault?
When considering designing a data lake around the Data Vault methodology, we should consider the potential pros and cons of it.
Advantages of using Data Vault:
- Provides auditability by historical record tracking
- Insert only architecture
- Flexible to variations and changes without re-engineering
- Can be built incrementally
- Fault-tolerant ingestion pipelines
- Keeps all data
- Decoupling of rules and data (Data vault stores business rules separately, so it is easier to update)
- Allows for data loads with a high degree of parallelism
- Technology agnostic
Disadvantages of using Data Vault:
- Data Vault is a structured solution hence teams need proper training for Data Vault implementation
- Data vault Models can be more complicated/complex
- A large storage may be needed to maintain a full history of data
- When data is ingested into Data Vault, it may not be instantly user ready. A business vault and information marts need to be created in order to provide value to the business
What are Hubs, Sats and Links in Data Vault?
There are 5 different types of tables in the Data Vault model. The original Data Vault spec had only 3 of them (Hubs, Sats and Links). The additional 2 tables (PIT and Bridge) were introduced with Data Vault 2.0.
Another big change in Data Vault 2.0 is the introduction of hash keys in the model. These hash keys are mandatory because of the many advantages. Hash keys do not only speed up the loading process; they also ensure that the enterprise data warehouse can span across multiple environments: on-premise databases, Hadoop clusters and cloud storage.
A Data Vault Hub contains a list of unique business keys. These business keys are keys or codes which the business uses to identify one record. Business keys have a low chance to change.
A Hub may represent any important business entity within an organization such as a product, store, or a customer.
Hubs also contain a surrogate key for each Hub item. A surrogate key is how the business identifies an object if no direct business key is available. Hubs also include metadata which describes the origin of the business key.
The Hub contains at least the following fields:
- A surrogate key, used to connect the other structures to this table.
- A business key, the driver for this hub. The business key can consist of multiple fields.
- The record source, which can be used to see what system loaded each business key first.
- You may also have metadata fields with information about manual updates (user/time) and the extraction date.
- A hub is not allowed to contain multiple business keys, except when two systems deliver the same business key but with collisions that have different meanings.
The descriptive attributes for the information on the Hub (such as the description for the key, possibly in multiple languages) are stored in structures called Satellite tables which will be discussed below.
Hubs should normally have at least one satellite.
Links stores the intersection of business keys (HUBS). Links can be considered the glue that holds the data vault model together.
In a Data Vault model, all source data relationships (i.e., foreign keys) and events are represented as Links. An important rule in Data Vault is that no Foreign Keys should be present in hubs. So, to represent joining between hubs, we can use a Link Table.
Associations or transactions between business keys (relating for instance the hubs for customer and product with each other through the purchase transaction) are modeled using link tables. These tables are basically many-to-many join tables, with some metadata.
The purpose of the Link is to capture and record the relationship of data elements at the lowest possible grain. A Link is therefore an intersection of business keys. It contains the columns that represent the business keys from the related Hubs.
A Link Table should have the following:
- A Link must have more than one parent table
- There must be at least two Hubs
- In the case of a transaction (referred to t_links), links may be composed of many Hubs
Sats or Satellite Tables
A Satellite consists of all the contextual details regarding an entity in Data Vault architecture.
Satellites add all the color and description to the business keys (hubs) and relationships (links) in the data vault environment. When there is a change in the data, a new row must be inserted with the changed data.
The main purpose of satellites is to let one know the information in effect at any point in time. Satellites usually contain descriptive information, tracking change by start and end dates over time.
Satellites are always directly related and are subordinate to a hub or a link. They offer context and definition to business keys.
There may be multiple satellites pointing to one hub or one link. This is because there could be multiple sources or different by data types. Generally, you would design a new Satellite table for each data source and then further separate data from those sources that may have a high frequency of change.
Looking at the Data Vault Structure- What does it look like?
Parts of the structure of a data vault looks quite like the traditional data warehouse model especially when it comes to dimensional modelling and star schemas.
Let’s consider an example such as a sales order for Sephora. A simplified star schema will end up looking like this:
Now considering the Data Vault models mentioned previously and applying this to the example, we get the following:
You can notice that there are a few more tables added, and we no longer have “single” tables representing the product, store, and customer. If you have a look at the h_customer table, it has two satellite tables. One table represents customers from Sephora’s CRM and the other customers from Sephora’s website. Since these are two different sources, they would be two different satellite tables.
Now, we can see how easy it would be to ingest new source data as the structure does not change.
The Architecture of Data Vault
The multi-tier architecture of Data Vault helps to provide us with a solution that is both flexible and scales well.
In Data Vault architecture, you will often see an area dedicated for staging. This is generally a landing zone for data flowing into the Data Vault. Generally, this area does not contain history data and usually just copies the structure (or scheme) of the source data.
The main purpose of the staging area is to ingest data from source systems quickly. Due to this, hard business rules can be applied to the data (and usually comprises of anything that doesn’t affect the actual data content).
Now, data in a staging area is usually only kept there until it is successfully loaded into the data warehouse. This is where between two loads, the tables in staging are dropped and recreated before the next stage load.
A PSA (Persistent Staging Layer) is important as you can change the Data Vault model completely (for example, when business concepts initially were not well understood), create new ETL Code to load data to the modified Data Vault and do a replay of loading full history of the data captured in the PSA
Enterprise Data Warehouses in Data Marts
What are Raw Vaults?
The Raw Vault is where your main data vault entities stay. These include (Hubs, Satellites, Links and T-Links).
Data is transferred from the staging area into the raw vault. There should be no business rules applied to the data when ingestion from raw to staging is happening. Essentially, the Raw Vault is the raw, unfiltered data from the source, loaded into Hubs, Links, and Satellites based on Business Keys.
What are Business Vaults?
The Business Vault is optional and is an extension of a Raw Vault that applies selected business rules, calculations, logic, and other query assistance functions in order to assist reporting and user accessibility. Business Vault tables should be refreshed once their dependent Raw Vault tables are refreshed.
The Business Vault bares resemblance to DV style tables where it includes Hubs, Sats and Links. That being said, it is not a clone of the Raw Vault. In the Business Vault, we will be able to transform the data in some way to apply rules or functions that most of the business users will find useful (as opposed to doing these repeatedly into multiple marts). This includes things like data cleansing, data quality, accounting rules, or well-defined repeatable calculations (e.g., net profit, total compensation, etc.).
The Information Marts are where valuable information is finally delivered to the business users typically through reports or dashboards. Rules and logic of the business are all applied, and the business users will be able to see this. Information Marts are typically implemented as views but can be physicalized if needed, e.g., for performance.
All-in-all, the Data Vault methodology proves to be a front-runner in data modeling solutions. Data Vault 2.0 includes an agile methodology and offers best practices and high-quality standards that are perfect for modern business with scaling data.
I think Data Vault is a very viable approach to meet the needs of an organization where historical tracking, auditability and scaling are important factors.
Interested in more? Check out our product, Vantage Point. Vantage Point (VP) is a no-code, click & go business acceleration tool which enables data driven decisions across your business. It drives interactivity across all parts of your organization by communicating value (KPIs), autogenerating tasks with cutting-edge ML/AI technology and enabling users to combine VP’s ML/AI recommendations with their own analysis. You can finally track the exact ROI impact throughout your entire business with Vantage Point.
Sign up for a demo with the link below
Senior Technology Architect
Senior Technology Architect