Data Vault Modeling
Why Data Vault Exists
Traditional dimensional models (star schemas) work well for single-source, stable-schema data warehouses. But enterprises face:
- Multiple source systems — Each with different schemas, keys, and semantics
- Frequent source changes — Systems get replaced, APIs change, new feeds arrive
- Audit requirements — Regulators need full traceability of every data change
- Parallel development — Multiple teams need to load data simultaneously without conflicts
- Agile delivery — Business can't wait 6 months for a monolithic data warehouse design
Data Vault, created by Dan Linstedt in the early 2000s, addresses all five by separating structure (Hubs and Links) from context (Satellites), enabling parallel, additive development.
Historical Context
- 2000: Dan Linstedt develops Data Vault at the US Department of Defense
- 2013: Data Vault 2.0 introduces hash keys, multi-active satellites, and Business Vault
- 2015-present: Adoption grows in regulated industries (banking, healthcare, government)
- 2020s: dbt and modern ELT tools make Data Vault implementation more accessible
The Core Philosophy
"The only constant is change. Model for change, not for stability."
Data Vault embraces change by making the model additive — new sources and entities are added without modifying existing structures. Nothing is ever updated or deleted.
First Principles
The Three Core Components
Hubs: Business Entity Identity
A Hub represents a unique business concept identified by its business key. Nothing more.
Rules:
- Contains only the business key, hash key, load date, and record source
- One Hub per unique business concept (Customer, Product, Order, Account)
- Never updated — only inserted
- Business key is the natural key from the source system
interface Hub {
hash_key: string; // MD5/SHA-256 of business key (PK)
business_key: string; // Natural key from source
load_date: Date; // When first loaded
record_source: string; // Source system identifier
}
// Example: Customer Hub
interface HubCustomer extends Hub {
// hash_key = hash('CUST-12345')
// business_key = 'CUST-12345'
// load_date = 2026-03-18T10:00:00Z
// record_source = 'CRM_SYSTEM'
}Satellites: Descriptive Context
A Satellite holds the descriptive attributes that change over time. It's like an SCD Type 2 implementation built into the model.
Rules:
- Contains the parent Hub/Link hash key, attributes, hash diff, load date, and record source
- One satellite per source system per rate of change
- Insert-only — new row on every change
- Hash diff enables efficient change detection
interface Satellite {
hub_hash_key: string; // FK to parent Hub
load_date: Date; // When this version was loaded (part of composite PK)
hash_diff: string; // Hash of all descriptive attributes
record_source: string; // Source system
load_end_date: Date | null; // When superseded (null = current)
}
// Example: Customer attributes from CRM
interface SatCustomerCRM extends Satellite {
customer_name: string;
email: string;
phone: string;
segment: string;
}
// Example: Customer attributes from billing system
interface SatCustomerBilling extends Satellite {
billing_address: string;
payment_method: string;
credit_limit: number;
}Links: Relationships
A Link captures the relationship between two or more Hubs. It represents a business event or association.
Rules:
- Contains hash keys of all participating Hubs
- Has its own hash key (hash of all participating business keys)
- Insert-only
- Represents N:M relationships by default
interface Link {
hash_key: string; // Hash of all participating business keys (PK)
load_date: Date;
record_source: string;
}
// Example: Customer-Order relationship
interface LinkCustomerOrder extends Link {
hub_customer_hash_key: string; // FK to Hub_Customer
hub_order_hash_key: string; // FK to Hub_Order
}
// Example: Order-Product relationship (with quantity in link satellite)
interface LinkOrderProduct extends Link {
hub_order_hash_key: string;
hub_product_hash_key: string;
}Hash Keys
Data Vault 2.0 introduced hash keys as the primary mechanism for key management. This is one of the most debated aspects of Data Vault.
Why Hash Keys?
- Deterministic: Same input always produces the same hash — enables parallel, idempotent loading
- Source-independent: No need for a centralized sequence generator
- Equality joins: Hash-to-hash joins are fast (fixed-width comparison)
- Cross-system integration: Same business key from different sources produces the same hash
Hash Key Generation
import { createHash } from 'crypto';
class DataVaultHasher {
private readonly separator = '||';
private readonly nullReplacement = '^^';
/**
* Generate a hash key from one or more business key components.
* All inputs are upper-cased and trimmed for consistency.
*/
hashKey(...components: Array<string | number | null>): string {
const normalized = components.map((c) => {
if (c === null || c === undefined) return this.nullReplacement;
return String(c).trim().toUpperCase();
});
const concatenated = normalized.join(this.separator);
return createHash('md5').update(concatenated).digest('hex');
}
/**
* Generate a hash diff from satellite attributes.
* Used for change detection — if hash diff hasn't changed,
* don't insert a new satellite record.
*/
hashDiff(attributes: Record<string, unknown>): string {
const sorted = Object.keys(attributes).sort();
const values = sorted.map((key) => {
const val = attributes[key];
if (val === null || val === undefined) return this.nullReplacement;
return String(val).trim().toUpperCase();
});
const concatenated = values.join(this.separator);
return createHash('md5').update(concatenated).digest('hex');
}
}
const hasher = new DataVaultHasher();
// Hub hash key
const customerHashKey = hasher.hashKey('CUST-12345');
// Always produces the same hash for the same customer
// Link hash key (composite)
const orderProductHashKey = hasher.hashKey('ORD-789', 'PROD-456');
// Satellite hash diff (for change detection)
const hashDiff = hasher.hashDiff({
name: 'John Doe',
email: 'john@example.com',
segment: 'Premium',
});WARNING
Hash collision risk: MD5 has a collision probability of approximately
This is negligibly small. However, if you're uncomfortable with MD5, use SHA-256 at the cost of larger key sizes (32 bytes vs 16 bytes).
Loading Patterns
Hub Loading
interface HubLoadResult {
inserted: number;
skipped: number; // Already existed
}
async function loadHub(
sourceRecords: Array<{ businessKey: string; recordSource: string }>,
hasher: DataVaultHasher,
db: Database,
): Promise<HubLoadResult> {
let inserted = 0;
let skipped = 0;
for (const record of sourceRecords) {
const hashKey = hasher.hashKey(record.businessKey);
// INSERT only if not exists (idempotent)
const result = await db.query(
`INSERT INTO hub_customer (hash_key, business_key, load_date, record_source)
SELECT $1, $2, CURRENT_TIMESTAMP, $3
WHERE NOT EXISTS (
SELECT 1 FROM hub_customer WHERE hash_key = $1
)`,
[hashKey, record.businessKey, record.recordSource],
);
if (result.rowCount > 0) {
inserted++;
} else {
skipped++;
}
}
return { inserted, skipped };
}
interface Database {
query(sql: string, params: unknown[]): Promise<{ rowCount: number }>;
}Satellite Loading with Change Detection
interface SatelliteLoadResult {
inserted: number;
unchanged: number;
}
async function loadSatellite(
sourceRecords: Array<{
businessKey: string;
attributes: Record<string, unknown>;
recordSource: string;
}>,
hasher: DataVaultHasher,
db: Database,
satelliteTable: string,
): Promise<SatelliteLoadResult> {
let inserted = 0;
let unchanged = 0;
for (const record of sourceRecords) {
const hubHashKey = hasher.hashKey(record.businessKey);
const newHashDiff = hasher.hashDiff(record.attributes);
// Check if the latest satellite record has the same hash diff
const existing = await db.query(
`SELECT hash_diff FROM ${satelliteTable}
WHERE hub_hash_key = $1
AND load_end_date IS NULL
ORDER BY load_date DESC
LIMIT 1`,
[hubHashKey],
);
if (existing.rowCount > 0 && (existing as any).rows[0].hash_diff === newHashDiff) {
unchanged++;
continue; // No change — skip
}
// End-date the previous current record
if (existing.rowCount > 0) {
await db.query(
`UPDATE ${satelliteTable}
SET load_end_date = CURRENT_TIMESTAMP
WHERE hub_hash_key = $1 AND load_end_date IS NULL`,
[hubHashKey],
);
}
// Insert new satellite record
const columns = Object.keys(record.attributes);
const values = Object.values(record.attributes);
const placeholders = columns.map((_, i) => `$${i + 5}`);
await db.query(
`INSERT INTO ${satelliteTable}
(hub_hash_key, load_date, load_end_date, hash_diff, record_source, ${columns.join(', ')})
VALUES ($1, CURRENT_TIMESTAMP, NULL, $2, $3, ${placeholders.join(', ')})`,
[hubHashKey, newHashDiff, record.recordSource, ...values],
);
inserted++;
}
return { inserted, unchanged };
}Link Loading
async function loadLink(
sourceRecords: Array<{
hubKeys: Array<{ hubName: string; businessKey: string }>;
recordSource: string;
}>,
hasher: DataVaultHasher,
db: Database,
linkTable: string,
): Promise<{ inserted: number; skipped: number }> {
let inserted = 0;
let skipped = 0;
for (const record of sourceRecords) {
const businessKeys = record.hubKeys.map((h) => h.businessKey);
const linkHashKey = hasher.hashKey(...businessKeys);
const hubHashKeys = record.hubKeys.map((h) => ({
column: `hub_${h.hubName}_hash_key`,
value: hasher.hashKey(h.businessKey),
}));
const hubColumns = hubHashKeys.map((h) => h.column).join(', ');
const hubPlaceholders = hubHashKeys.map((_, i) => `$${i + 4}`).join(', ');
const hubValues = hubHashKeys.map((h) => h.value);
const result = await db.query(
`INSERT INTO ${linkTable}
(hash_key, load_date, record_source, ${hubColumns})
SELECT $1, CURRENT_TIMESTAMP, $2, ${hubPlaceholders}
WHERE NOT EXISTS (
SELECT 1 FROM ${linkTable} WHERE hash_key = $1
)`,
[linkHashKey, record.recordSource, ...hubValues],
);
if (result.rowCount > 0) {
inserted++;
} else {
skipped++;
}
}
return { inserted, skipped };
}Business Vault
The Raw Vault stores data as-is from sources. The Business Vault applies business rules and transformations on top.
Same-As Links (Entity Resolution)
When the same real-world entity has different business keys in different systems:
interface SameAsLink {
hash_key: string; // Hash of both business keys
hub_customer_master_hash_key: string; // "Master" record
hub_customer_duplicate_hash_key: string; // "Duplicate" record
load_date: Date;
record_source: string; // 'ENTITY_RESOLUTION_ENGINE'
confidence_score: number; // 0.0 to 1.0
}
// CRM says "John Doe, john@email.com" = CUST-001
// Billing says "J. Doe, john@email.com" = CUST-A001
// Same-As Link: CUST-001 <-> CUST-A001, confidence=0.95Point-in-Time (PIT) Tables
Satellites store history, but querying the current state of an entity requires joining multiple satellites and finding the latest record for each. PIT tables pre-compute this:
interface PointInTimeRow {
hub_hash_key: string;
snapshot_date: Date;
// Latest load_date from each satellite as of snapshot_date
sat_customer_crm_load_date: Date | null;
sat_customer_billing_load_date: Date | null;
sat_customer_web_load_date: Date | null;
}
// Query: get all current customer attributes
// Without PIT: 3 subqueries with MAX(load_date) GROUP BY hub_hash_key
// With PIT: simple join using pre-computed load dates-- PIT query (fast)
SELECT
h.business_key,
s1.customer_name,
s1.email,
s2.billing_address,
s3.web_preferences
FROM hub_customer h
JOIN pit_customer pit ON h.hash_key = pit.hub_hash_key
AND pit.snapshot_date = CURRENT_DATE
LEFT JOIN sat_customer_crm s1 ON h.hash_key = s1.hub_hash_key
AND s1.load_date = pit.sat_customer_crm_load_date
LEFT JOIN sat_customer_billing s2 ON h.hash_key = s2.hub_hash_key
AND s2.load_date = pit.sat_customer_billing_load_date
LEFT JOIN sat_customer_web s3 ON h.hash_key = s3.hub_hash_key
AND s3.load_date = pit.sat_customer_web_load_date;Performance Characteristics
Query Performance
Data Vault queries typically require 3-way joins minimum (Hub + Link + Satellite):
For complex queries spanning multiple hubs:
This is more expensive than star schema queries. PIT and Bridge tables mitigate this.
Loading Performance
Data Vault loading is highly parallelizable:
Since Hubs, Links, and Satellites can be loaded independently:
| Operation | Dependencies | Parallelizable |
|---|---|---|
| Hub load | None | Yes (per Hub) |
| Link load | Hub hash keys must exist | Yes (per Link, after Hubs) |
| Satellite load | Hub/Link hash keys must exist | Yes (per Satellite) |
Storage Overhead
Data Vault stores more metadata than star schemas:
Where 48B and 64B account for hash keys, load dates, and record sources per row.
For 100M customers with 5 satellites averaging 3 historical records each:
Edge Cases & Failure Modes
Multi-Active Satellites
Some satellites have multiple active records per Hub at the same time (e.g., a customer with multiple phone numbers):
interface MultiActiveSatellite extends Satellite {
// Additional discriminator key(s) to identify which sub-record
phone_type: 'home' | 'work' | 'mobile';
phone_number: string;
}
// PK: (hub_hash_key, load_date, phone_type)
// This allows multiple active phone numbers per customerEffectivity Satellites
Track when a Link relationship is active or inactive:
interface EffectivitySatellite {
link_hash_key: string;
load_date: Date;
effective_from: Date;
effective_to: Date | null; // null = currently effective
record_source: string;
}
// Employee-Department relationship:
// Active from 2025-01-01 to 2026-02-15 (transferred)
// Active from 2026-02-15 to null (current department)Reference Tables
Low-volatility lookup data (country codes, currency codes) doesn't need full Hub/Satellite treatment:
interface ReferenceTable {
// Simple reference data — no hash keys needed
code: string;
description: string;
load_date: Date;
record_source: string;
}Mathematical Foundations
Graph Theory Perspective
A Data Vault model is a bipartite graph:
Where:
= set of Hubs (business entities) = set of Links (relationships) = edges connecting Links to their participating Hubs
A Link with
Normalization Level
Data Vault achieves a form between 3NF and 6NF:
In practice, Satellites group attributes by source system and rate of change, which is a pragmatic relaxation of 6NF.
Real-World War Stories
War Story
The Banking Merger
Two banks merged, each with different customer master systems. The traditional approach would require:
- Map all customer attributes from both systems
- Define a unified customer schema
- Build complex deduplication logic
- Migrate everything at once
With Data Vault:
- Created Hub_Customer with business keys from both systems
- Created satellites for each bank's customer attributes (no schema unification needed)
- Created Same-As Links for matched customers
- Both banks' data loaded independently, in parallel
- Business Vault applied unified rules over time
Timeline: Traditional approach estimated 18 months. Data Vault approach: 3 months to initial load, ongoing refinement.
War Story
The Regulatory Audit
A healthcare company was audited for HIPAA compliance. Auditors wanted to know: "What was Patient X's address on exactly March 15, 2025, and which system was the source of that information?"
With a star schema (SCD Type 2), they could answer WHEN the address changed but not precisely WHICH SOURCE provided the data and WHEN it was loaded.
With Data Vault:
- Satellite record:
hub_hash_key, load_date=2025-03-10, record_source='EHR_SYSTEM', address='123 Main St' - Next record:
hub_hash_key, load_date=2025-03-20, record_source='BILLING_SYSTEM', address='456 Oak Ave'
On March 15, the address was "123 Main St" from the EHR system, loaded on March 10. Audit question answered in one query.
Decision Framework
When to Use Data Vault
| Scenario | Data Vault? | Why |
|---|---|---|
| Single source, simple analytics | No | Overkill — use star schema |
| 5+ source systems | Yes | Integration strength |
| Regulated industry (audit trail) | Yes | Built-in auditability |
| Agile / iterative delivery | Yes | Additive model |
| Small team (< 3 people) | No | Complexity overhead |
| Real-time analytics | No | Query complexity |
| Historical tracking required | Yes | Insert-only design |
| Frequent schema changes | Yes | Source-independent structure |
Data Vault vs. Star Schema vs. 3NF
| Aspect | 3NF | Star Schema | Data Vault |
|---|---|---|---|
| Load complexity | Simple | Medium | Medium-High |
| Query complexity | High | Low | High (without PIT) |
| Schema changes | Hard | Medium | Easy (additive) |
| Audit trail | None | Limited (SCD) | Full |
| Parallel loading | Limited | Limited | Excellent |
| Storage efficiency | Best | Good | Worst |
| Learning curve | Low | Medium | High |
Advanced Topics
Automating Data Vault with dbt
// dbt macro concept for auto-generating Hub loading SQL
interface DbtDataVaultConfig {
hub: {
tableName: string;
businessKey: string;
sourceTable: string;
sourceColumn: string;
};
satellites: Array<{
tableName: string;
hubReference: string;
attributes: string[];
sourceTable: string;
}>;
links: Array<{
tableName: string;
hubs: Array<{ hubName: string; businessKeyColumn: string }>;
sourceTable: string;
}>;
}
// This config generates the entire Data Vault DDL and loading SQL
const orderVaultConfig: DbtDataVaultConfig = {
hub: {
tableName: 'hub_order',
businessKey: 'order_id',
sourceTable: 'stg_orders',
sourceColumn: 'order_id',
},
satellites: [
{
tableName: 'sat_order_details',
hubReference: 'hub_order',
attributes: ['status', 'total_amount', 'shipping_method'],
sourceTable: 'stg_orders',
},
],
links: [
{
tableName: 'link_customer_order',
hubs: [
{ hubName: 'hub_customer', businessKeyColumn: 'customer_id' },
{ hubName: 'hub_order', businessKeyColumn: 'order_id' },
],
sourceTable: 'stg_orders',
},
],
};Data Vault on Modern Cloud Warehouses
Cloud warehouses change the Data Vault calculus:
- Snowflake/BigQuery: Columnar storage makes the join overhead of Data Vault less painful
- Semi-structured support: Satellites can store JSON payloads, reducing the need for separate satellites per attribute group
- Auto-clustering: Reduces the need for manual partitioning of Hub and Satellite tables
- Time travel: Provides some of the auditability that Data Vault's insert-only design was meant to provide
Research: Ensemble Modeling
An emerging approach that combines Data Vault for the integration layer with dimensional models for the presentation layer, using automated metadata-driven transformations.
Cross-References
- Data Modeling Overview — Paradigm comparison
- Dimensional Modeling — Star/snowflake for presentation layer
- Slowly Changing Dimensions — SCD handling in Data Vault satellites
- Schema Evolution — How Data Vault handles schema changes
- Data Lineage — Traceability through Data Vault
Key Takeaway
- Data Vault uses three entity types: Hubs (business keys), Links (relationships), and Satellites (descriptive attributes with full history) for enterprise-scale data integration.
- Hash keys on business keys enable parallel, insert-only loading from multiple sources with no coordination between teams.
- Data Vault is the integration layer, not the presentation layer -- always build star schemas (Business Vault or Information Mart) on top for analytics.
Exercise
Model a Multi-Source Customer Data Integration
You have customer data arriving from three source systems:
- CRM (Salesforce):
customer_id,name,email,segment,owner - E-commerce:
user_id,email,signup_date,last_login,preferences - Support (Zendesk):
contact_id,email,name,plan_tier,ticket_count
All three systems represent the same customers but with different keys. Design a Data Vault model that:
- Creates a Hub for the customer business entity
- Creates Satellites for each source's attributes
- Handles the key resolution problem (different IDs for the same person)
- Supports loading from all three sources in parallel
Solution
Hub: hub_customer
hub_customer_hash_key= MD5(UPPER(TRIM(email))) -- email is the business key across systemsload_date,record_sourceemail(the business key)
Satellites:
sat_customer_crm-- name, segment, owner, load_date, hash_diff, record_source='CRM'sat_customer_ecommerce-- signup_date, last_login, preferences, load_date, hash_diff, record_source='ECOM'sat_customer_support-- plan_tier, ticket_count, load_date, hash_diff, record_source='ZENDESK'
Key Resolution:
- Use email (normalized: UPPER + TRIM) as the business key for hub hash generation.
- Create a
sat_customer_source_keyssatellite mapping each source system's native ID to the hub:{crm_customer_id, ecom_user_id, zendesk_contact_id}. - If email is unreliable, add a Link (
link_customer_identity) connecting hub records that represent the same person via fuzzy matching.
Parallel Loading:
- Each source loads independently: insert into Hub (ON CONFLICT DO NOTHING), then insert new Satellite rows (only when hash_diff changes). No coordination needed because hub inserts are idempotent on the hash key.
Common Misconceptions
- "Data Vault replaces star schemas." Data Vault is an integration/historical tracking layer. You still need star schemas (or materialized views) for analytics and BI consumption.
- "Data Vault is only for large enterprises." Small teams with a single data source get little benefit from the Hub/Link/Satellite overhead. Data Vault shines with multiple source systems and audit requirements.
- "Hash keys eliminate the need for business key analysis." You still must identify the correct business key for each hub. Hashing the wrong key produces a wrong model. Hash keys just make loading faster.
- "Satellites should store all attributes from all sources." Each satellite should contain attributes from ONE source or ONE domain. Mixing sources in a satellite defeats the purpose of parallel, independent loading.
In Production
- ING Bank uses Data Vault 2.0 to integrate data from 200+ source systems across their global banking operations, with full audit trail for regulatory compliance (Basel III, GDPR).
- Uber adopted Data Vault principles for their data lake integration layer, using hash-based keys for parallel loading of ride, payment, and driver data from separate microservices.
- USAA (financial services) uses Data Vault to maintain complete history of all customer interactions across banking, insurance, and investment systems for regulatory reporting.
- AutoTrader UK uses Data Vault for their vehicle, dealer, and listing data integration, supporting 15+ source systems with automated daily loading.
Quiz
1. What are the three core entity types in Data Vault?
A) Fact, Dimension, Bridge B) Hub, Link, Satellite C) Source, Staging, Target D) Entity, Attribute, Relationship
Answer
B) Hub (unique business keys), Link (relationships between hubs), and Satellite (descriptive attributes with full change history). Together they form a flexible, auditable integration layer.
2. Why are hash keys used instead of natural keys in Data Vault?
A) Hash keys are shorter B) Hash keys enable parallel loading from multiple sources without sequence coordination, and provide consistent key width for joins C) Hash keys are more secure D) Natural keys are not supported in modern databases
Answer
B) Hash keys (MD5/SHA-256 of the business key) have consistent length, enable parallel insert-only loading from multiple sources without needing a central sequence generator, and make joins faster due to fixed-width key comparisons.
3. What is a Business Vault?
A) A physical vault for storing backup tapes B) A layer of derived tables built on top of the Raw Data Vault that applies business rules, calculations, and soft rules C) A vault that stores business documents D) The production database
Answer
B) The Business Vault adds computed attributes, business rules (e.g., "customer lifetime value"), and soft relationships that require business logic. It sits between the Raw Vault (pure source data) and the Information Mart (star schemas for analytics).
4. How does Data Vault handle schema changes from source systems?
A) It requires a full rebuild of the model B) New attributes are added as new columns in existing satellites, or as new satellites -- no existing structure changes C) Schema changes are not supported D) All data must be reloaded from scratch
Answer
B) Adding a new field from a source = add a column to the existing satellite or create a new satellite. Removing a field = stop populating it (NULLs in new rows). The insert-only pattern means historical data is never modified.
5. When would you NOT choose Data Vault?
A) When you have 200+ source systems B) When regulatory compliance requires full audit trails C) When you have a single source system and a small team with no audit requirements D) When you need parallel loading from multiple sources
Answer
C) Data Vault's Hub/Link/Satellite overhead provides little benefit when you have one source, no audit requirements, and a small team. A simple star schema or even a denormalized table is more practical in that scenario.
:::
One-Liner Summary: Data Vault separates business identity (Hubs), relationships (Links), and change history (Satellites) to enable parallel, audit-ready integration from any number of source systems.