Data Vault Modeling

Why Data Vault Exists

Traditional dimensional models (star schemas) work well for single-source, stable-schema data warehouses. But enterprises face:

Multiple source systems — Each with different schemas, keys, and semantics
Frequent source changes — Systems get replaced, APIs change, new feeds arrive
Audit requirements — Regulators need full traceability of every data change
Parallel development — Multiple teams need to load data simultaneously without conflicts
Agile delivery — Business can't wait 6 months for a monolithic data warehouse design

Data Vault, created by Dan Linstedt in the early 2000s, addresses all five by separating structure (Hubs and Links) from context (Satellites), enabling parallel, additive development.

Historical Context

2000: Dan Linstedt develops Data Vault at the US Department of Defense
2013: Data Vault 2.0 introduces hash keys, multi-active satellites, and Business Vault
2015-present: Adoption grows in regulated industries (banking, healthcare, government)
2020s: dbt and modern ELT tools make Data Vault implementation more accessible

The Core Philosophy

"The only constant is change. Model for change, not for stability."

Data Vault embraces change by making the model additive — new sources and entities are added without modifying existing structures. Nothing is ever updated or deleted.

First Principles

The Three Core Components

Hubs: Business Entity Identity

A Hub represents a unique business concept identified by its business key. Nothing more.

Rules:

Contains only the business key, hash key, load date, and record source
One Hub per unique business concept (Customer, Product, Order, Account)
Never updated — only inserted
Business key is the natural key from the source system

typescript

interface Hub {
  hash_key: string;          // MD5/SHA-256 of business key (PK)
  business_key: string;      // Natural key from source
  load_date: Date;           // When first loaded
  record_source: string;     // Source system identifier
}

// Example: Customer Hub
interface HubCustomer extends Hub {
  // hash_key = hash('CUST-12345')
  // business_key = 'CUST-12345'
  // load_date = 2026-03-18T10:00:00Z
  // record_source = 'CRM_SYSTEM'
}

Satellites: Descriptive Context

A Satellite holds the descriptive attributes that change over time. It's like an SCD Type 2 implementation built into the model.

Rules:

Contains the parent Hub/Link hash key, attributes, hash diff, load date, and record source
One satellite per source system per rate of change
Insert-only — new row on every change
Hash diff enables efficient change detection

typescript

interface Satellite {
  hub_hash_key: string;      // FK to parent Hub
  load_date: Date;           // When this version was loaded (part of composite PK)
  hash_diff: string;         // Hash of all descriptive attributes
  record_source: string;     // Source system
  load_end_date: Date | null; // When superseded (null = current)
}

// Example: Customer attributes from CRM
interface SatCustomerCRM extends Satellite {
  customer_name: string;
  email: string;
  phone: string;
  segment: string;
}

// Example: Customer attributes from billing system
interface SatCustomerBilling extends Satellite {
  billing_address: string;
  payment_method: string;
  credit_limit: number;
}

Links: Relationships

A Link captures the relationship between two or more Hubs. It represents a business event or association.

Rules:

Contains hash keys of all participating Hubs
Has its own hash key (hash of all participating business keys)
Insert-only
Represents N:M relationships by default

typescript

interface Link {
  hash_key: string;            // Hash of all participating business keys (PK)
  load_date: Date;
  record_source: string;
}

// Example: Customer-Order relationship
interface LinkCustomerOrder extends Link {
  hub_customer_hash_key: string;   // FK to Hub_Customer
  hub_order_hash_key: string;      // FK to Hub_Order
}

// Example: Order-Product relationship (with quantity in link satellite)
interface LinkOrderProduct extends Link {
  hub_order_hash_key: string;
  hub_product_hash_key: string;
}

Hash Keys

Data Vault 2.0 introduced hash keys as the primary mechanism for key management. This is one of the most debated aspects of Data Vault.

Why Hash Keys?

Deterministic: Same input always produces the same hash — enables parallel, idempotent loading
Source-independent: No need for a centralized sequence generator
Equality joins: Hash-to-hash joins are fast (fixed-width comparison)
Cross-system integration: Same business key from different sources produces the same hash

Hash Key Generation

typescript

import { createHash } from 'crypto';

class DataVaultHasher {
  private readonly separator = '||';
  private readonly nullReplacement = '^^';

  /**
   * Generate a hash key from one or more business key components.
   * All inputs are upper-cased and trimmed for consistency.
   */
  hashKey(...components: Array<string | number | null>): string {
    const normalized = components.map((c) => {
      if (c === null || c === undefined) return this.nullReplacement;
      return String(c).trim().toUpperCase();
    });

    const concatenated = normalized.join(this.separator);

    return createHash('md5').update(concatenated).digest('hex');
  }

  /**
   * Generate a hash diff from satellite attributes.
   * Used for change detection — if hash diff hasn't changed,
   * don't insert a new satellite record.
   */
  hashDiff(attributes: Record<string, unknown>): string {
    const sorted = Object.keys(attributes).sort();
    const values = sorted.map((key) => {
      const val = attributes[key];
      if (val === null || val === undefined) return this.nullReplacement;
      return String(val).trim().toUpperCase();
    });

    const concatenated = values.join(this.separator);
    return createHash('md5').update(concatenated).digest('hex');
  }
}

const hasher = new DataVaultHasher();

// Hub hash key
const customerHashKey = hasher.hashKey('CUST-12345');
// Always produces the same hash for the same customer

// Link hash key (composite)
const orderProductHashKey = hasher.hashKey('ORD-789', 'PROD-456');

// Satellite hash diff (for change detection)
const hashDiff = hasher.hashDiff({
  name: 'John Doe',
  email: 'john@example.com',
  segment: 'Premium',
});

WARNING

Hash collision risk: MD5 has a collision probability of approximately $\frac{1}{2^{64}}$ for randomly distributed inputs. For a warehouse with 1 billion entities, the probability of at least one collision is:

P (collision) \approx 1 - e^{- \frac{n^{2}}{2 \times 2^{128}}} \approx 1.47 \times 10^{- 21}

This is negligibly small. However, if you're uncomfortable with MD5, use SHA-256 at the cost of larger key sizes (32 bytes vs 16 bytes).

Loading Patterns

Hub Loading

typescript

interface HubLoadResult {
  inserted: number;
  skipped: number; // Already existed
}

async function loadHub(
  sourceRecords: Array<{ businessKey: string; recordSource: string }>,
  hasher: DataVaultHasher,
  db: Database,
): Promise<HubLoadResult> {
  let inserted = 0;
  let skipped = 0;

  for (const record of sourceRecords) {
    const hashKey = hasher.hashKey(record.businessKey);

    // INSERT only if not exists (idempotent)
    const result = await db.query(
      `INSERT INTO hub_customer (hash_key, business_key, load_date, record_source)
       SELECT $1, $2, CURRENT_TIMESTAMP, $3
       WHERE NOT EXISTS (
         SELECT 1 FROM hub_customer WHERE hash_key = $1
       )`,
      [hashKey, record.businessKey, record.recordSource],
    );

    if (result.rowCount > 0) {
      inserted++;
    } else {
      skipped++;
    }
  }

  return { inserted, skipped };
}

interface Database {
  query(sql: string, params: unknown[]): Promise<{ rowCount: number }>;
}

Satellite Loading with Change Detection

typescript

interface SatelliteLoadResult {
  inserted: number;
  unchanged: number;
}

async function loadSatellite(
  sourceRecords: Array<{
    businessKey: string;
    attributes: Record<string, unknown>;
    recordSource: string;
  }>,
  hasher: DataVaultHasher,
  db: Database,
  satelliteTable: string,
): Promise<SatelliteLoadResult> {
  let inserted = 0;
  let unchanged = 0;

  for (const record of sourceRecords) {
    const hubHashKey = hasher.hashKey(record.businessKey);
    const newHashDiff = hasher.hashDiff(record.attributes);

    // Check if the latest satellite record has the same hash diff
    const existing = await db.query(
      `SELECT hash_diff FROM ${satelliteTable}
       WHERE hub_hash_key = $1
       AND load_end_date IS NULL
       ORDER BY load_date DESC
       LIMIT 1`,
      [hubHashKey],
    );

    if (existing.rowCount > 0 && (existing as any).rows[0].hash_diff === newHashDiff) {
      unchanged++;
      continue; // No change — skip
    }

    // End-date the previous current record
    if (existing.rowCount > 0) {
      await db.query(
        `UPDATE ${satelliteTable}
         SET load_end_date = CURRENT_TIMESTAMP
         WHERE hub_hash_key = $1 AND load_end_date IS NULL`,
        [hubHashKey],
      );
    }

    // Insert new satellite record
    const columns = Object.keys(record.attributes);
    const values = Object.values(record.attributes);
    const placeholders = columns.map((_, i) => `$${i + 5}`);

    await db.query(
      `INSERT INTO ${satelliteTable}
       (hub_hash_key, load_date, load_end_date, hash_diff, record_source, ${columns.join(', ')})
       VALUES ($1, CURRENT_TIMESTAMP, NULL, $2, $3, ${placeholders.join(', ')})`,
      [hubHashKey, newHashDiff, record.recordSource, ...values],
    );

    inserted++;
  }

  return { inserted, unchanged };
}

Link Loading

typescript

async function loadLink(
  sourceRecords: Array<{
    hubKeys: Array<{ hubName: string; businessKey: string }>;
    recordSource: string;
  }>,
  hasher: DataVaultHasher,
  db: Database,
  linkTable: string,
): Promise<{ inserted: number; skipped: number }> {
  let inserted = 0;
  let skipped = 0;

  for (const record of sourceRecords) {
    const businessKeys = record.hubKeys.map((h) => h.businessKey);
    const linkHashKey = hasher.hashKey(...businessKeys);

    const hubHashKeys = record.hubKeys.map((h) => ({
      column: `hub_${h.hubName}_hash_key`,
      value: hasher.hashKey(h.businessKey),
    }));

    const hubColumns = hubHashKeys.map((h) => h.column).join(', ');
    const hubPlaceholders = hubHashKeys.map((_, i) => `$${i + 4}`).join(', ');
    const hubValues = hubHashKeys.map((h) => h.value);

    const result = await db.query(
      `INSERT INTO ${linkTable}
       (hash_key, load_date, record_source, ${hubColumns})
       SELECT $1, CURRENT_TIMESTAMP, $2, ${hubPlaceholders}
       WHERE NOT EXISTS (
         SELECT 1 FROM ${linkTable} WHERE hash_key = $1
       )`,
      [linkHashKey, record.recordSource, ...hubValues],
    );

    if (result.rowCount > 0) {
      inserted++;
    } else {
      skipped++;
    }
  }

  return { inserted, skipped };
}

Business Vault

The Raw Vault stores data as-is from sources. The Business Vault applies business rules and transformations on top.

Same-As Links (Entity Resolution)

When the same real-world entity has different business keys in different systems:

typescript

interface SameAsLink {
  hash_key: string;           // Hash of both business keys
  hub_customer_master_hash_key: string;  // "Master" record
  hub_customer_duplicate_hash_key: string; // "Duplicate" record
  load_date: Date;
  record_source: string;      // 'ENTITY_RESOLUTION_ENGINE'
  confidence_score: number;   // 0.0 to 1.0
}

// CRM says "John Doe, john@email.com" = CUST-001
// Billing says "J. Doe, john@email.com" = CUST-A001
// Same-As Link: CUST-001 <-> CUST-A001, confidence=0.95

Point-in-Time (PIT) Tables

Satellites store history, but querying the current state of an entity requires joining multiple satellites and finding the latest record for each. PIT tables pre-compute this:

typescript

interface PointInTimeRow {
  hub_hash_key: string;
  snapshot_date: Date;
  // Latest load_date from each satellite as of snapshot_date
  sat_customer_crm_load_date: Date | null;
  sat_customer_billing_load_date: Date | null;
  sat_customer_web_load_date: Date | null;
}

// Query: get all current customer attributes
// Without PIT: 3 subqueries with MAX(load_date) GROUP BY hub_hash_key
// With PIT: simple join using pre-computed load dates

sql

-- PIT query (fast)
SELECT
    h.business_key,
    s1.customer_name,
    s1.email,
    s2.billing_address,
    s3.web_preferences
FROM hub_customer h
JOIN pit_customer pit ON h.hash_key = pit.hub_hash_key
    AND pit.snapshot_date = CURRENT_DATE
LEFT JOIN sat_customer_crm s1 ON h.hash_key = s1.hub_hash_key
    AND s1.load_date = pit.sat_customer_crm_load_date
LEFT JOIN sat_customer_billing s2 ON h.hash_key = s2.hub_hash_key
    AND s2.load_date = pit.sat_customer_billing_load_date
LEFT JOIN sat_customer_web s3 ON h.hash_key = s3.hub_hash_key
    AND s3.load_date = pit.sat_customer_web_load_date;

Performance Characteristics

Query Performance

Data Vault queries typically require 3-way joins minimum (Hub + Link + Satellite):

Basic query cost = | H | ⋈ | L | ⋈ | S | = O (| H | + | L | + | S |) with hash joins

For complex queries spanning multiple hubs:

Complex query cost = \prod_{i} | L_{i} | \times \sum_{j} | S_{j} |

This is more expensive than star schema queries. PIT and Bridge tables mitigate this.

Loading Performance

Data Vault loading is highly parallelizable:

{Load time}_{parallel} = max (Hub loads, Link loads, Satellite loads)

Since Hubs, Links, and Satellites can be loaded independently:

Operation	Dependencies	Parallelizable
Hub load	None	Yes (per Hub)
Link load	Hub hash keys must exist	Yes (per Link, after Hubs)
Satellite load	Hub/Link hash keys must exist	Yes (per Satellite)

Storage Overhead

Data Vault stores more metadata than star schemas:

Overhead = \sum_{hubs} | H | \times 48 B + \sum_{links} | L | \times 64 B + \sum_{sats} | S | \times 48 B

Where 48B and 64B account for hash keys, load dates, and record sources per row.

For 100M customers with 5 satellites averaging 3 historical records each:

Hub : 100 M \times 48 B = 4.8 GB

Satellites : 100 M \times 5 \times 3 \times (48 B + attribute_size)

Edge Cases & Failure Modes

Multi-Active Satellites

Some satellites have multiple active records per Hub at the same time (e.g., a customer with multiple phone numbers):

typescript

interface MultiActiveSatellite extends Satellite {
  // Additional discriminator key(s) to identify which sub-record
  phone_type: 'home' | 'work' | 'mobile';
  phone_number: string;
}

// PK: (hub_hash_key, load_date, phone_type)
// This allows multiple active phone numbers per customer

Effectivity Satellites

Track when a Link relationship is active or inactive:

typescript

interface EffectivitySatellite {
  link_hash_key: string;
  load_date: Date;
  effective_from: Date;
  effective_to: Date | null; // null = currently effective
  record_source: string;
}

// Employee-Department relationship:
// Active from 2025-01-01 to 2026-02-15 (transferred)
// Active from 2026-02-15 to null (current department)

Reference Tables

Low-volatility lookup data (country codes, currency codes) doesn't need full Hub/Satellite treatment:

typescript

interface ReferenceTable {
  // Simple reference data — no hash keys needed
  code: string;
  description: string;
  load_date: Date;
  record_source: string;
}

Mathematical Foundations

Graph Theory Perspective

A Data Vault model is a bipartite graph:

G = (H \cup L, E)

Where:

$H$ = set of Hubs (business entities)
$L$ = set of Links (relationships)
$E$ = edges connecting Links to their participating Hubs

A Link with $n$ participating Hubs has degree $n$ in the graph.

Normalization Level

Data Vault achieves a form between 3NF and 6NF:

Hub \in 3 NF (single candidate key)

Satellite \approx 6 NF (one attribute per table, relaxed)

In practice, Satellites group attributes by source system and rate of change, which is a pragmatic relaxation of 6NF.

Real-World War Stories

War Story

The Banking Merger

Two banks merged, each with different customer master systems. The traditional approach would require:

Map all customer attributes from both systems
Define a unified customer schema
Build complex deduplication logic
Migrate everything at once

With Data Vault:

Created Hub_Customer with business keys from both systems
Created satellites for each bank's customer attributes (no schema unification needed)
Created Same-As Links for matched customers
Both banks' data loaded independently, in parallel
Business Vault applied unified rules over time

Timeline: Traditional approach estimated 18 months. Data Vault approach: 3 months to initial load, ongoing refinement.

War Story

The Regulatory Audit

A healthcare company was audited for HIPAA compliance. Auditors wanted to know: "What was Patient X's address on exactly March 15, 2025, and which system was the source of that information?"

With a star schema (SCD Type 2), they could answer WHEN the address changed but not precisely WHICH SOURCE provided the data and WHEN it was loaded.

With Data Vault:

Satellite record: hub_hash_key, load_date=2025-03-10, record_source='EHR_SYSTEM', address='123 Main St'
Next record: hub_hash_key, load_date=2025-03-20, record_source='BILLING_SYSTEM', address='456 Oak Ave'

On March 15, the address was "123 Main St" from the EHR system, loaded on March 10. Audit question answered in one query.

Decision Framework

When to Use Data Vault

Scenario	Data Vault?	Why
Single source, simple analytics	No	Overkill — use star schema
5+ source systems	Yes	Integration strength
Regulated industry (audit trail)	Yes	Built-in auditability
Agile / iterative delivery	Yes	Additive model
Small team (< 3 people)	No	Complexity overhead
Real-time analytics	No	Query complexity
Historical tracking required	Yes	Insert-only design
Frequent schema changes	Yes	Source-independent structure

Data Vault vs. Star Schema vs. 3NF

Aspect	3NF	Star Schema	Data Vault
Load complexity	Simple	Medium	Medium-High
Query complexity	High	Low	High (without PIT)
Schema changes	Hard	Medium	Easy (additive)
Audit trail	None	Limited (SCD)	Full
Parallel loading	Limited	Limited	Excellent
Storage efficiency	Best	Good	Worst
Learning curve	Low	Medium	High

Advanced Topics

Automating Data Vault with dbt

typescript

// dbt macro concept for auto-generating Hub loading SQL
interface DbtDataVaultConfig {
  hub: {
    tableName: string;
    businessKey: string;
    sourceTable: string;
    sourceColumn: string;
  };
  satellites: Array<{
    tableName: string;
    hubReference: string;
    attributes: string[];
    sourceTable: string;
  }>;
  links: Array<{
    tableName: string;
    hubs: Array<{ hubName: string; businessKeyColumn: string }>;
    sourceTable: string;
  }>;
}

// This config generates the entire Data Vault DDL and loading SQL
const orderVaultConfig: DbtDataVaultConfig = {
  hub: {
    tableName: 'hub_order',
    businessKey: 'order_id',
    sourceTable: 'stg_orders',
    sourceColumn: 'order_id',
  },
  satellites: [
    {
      tableName: 'sat_order_details',
      hubReference: 'hub_order',
      attributes: ['status', 'total_amount', 'shipping_method'],
      sourceTable: 'stg_orders',
    },
  ],
  links: [
    {
      tableName: 'link_customer_order',
      hubs: [
        { hubName: 'hub_customer', businessKeyColumn: 'customer_id' },
        { hubName: 'hub_order', businessKeyColumn: 'order_id' },
      ],
      sourceTable: 'stg_orders',
    },
  ],
};

Data Vault on Modern Cloud Warehouses

Cloud warehouses change the Data Vault calculus:

Snowflake/BigQuery: Columnar storage makes the join overhead of Data Vault less painful
Semi-structured support: Satellites can store JSON payloads, reducing the need for separate satellites per attribute group
Auto-clustering: Reduces the need for manual partitioning of Hub and Satellite tables
Time travel: Provides some of the auditability that Data Vault's insert-only design was meant to provide

Research: Ensemble Modeling

An emerging approach that combines Data Vault for the integration layer with dimensional models for the presentation layer, using automated metadata-driven transformations.

Cross-References

Data Modeling Overview — Paradigm comparison
Dimensional Modeling — Star/snowflake for presentation layer
Slowly Changing Dimensions — SCD handling in Data Vault satellites
Schema Evolution — How Data Vault handles schema changes
Data Lineage — Traceability through Data Vault

Key Takeaway

Data Vault uses three entity types: Hubs (business keys), Links (relationships), and Satellites (descriptive attributes with full history) for enterprise-scale data integration.
Hash keys on business keys enable parallel, insert-only loading from multiple sources with no coordination between teams.
Data Vault is the integration layer, not the presentation layer -- always build star schemas (Business Vault or Information Mart) on top for analytics.

Exercise

Model a Multi-Source Customer Data Integration

You have customer data arriving from three source systems:

CRM (Salesforce): customer_id, name, email, segment, owner
E-commerce: user_id, email, signup_date, last_login, preferences
Support (Zendesk): contact_id, email, name, plan_tier, ticket_count

All three systems represent the same customers but with different keys. Design a Data Vault model that:

Creates a Hub for the customer business entity
Creates Satellites for each source's attributes
Handles the key resolution problem (different IDs for the same person)
Supports loading from all three sources in parallel

Solution

Hub: hub_customer

hub_customer_hash_key = MD5(UPPER(TRIM(email))) -- email is the business key across systems
load_date, record_source
email (the business key)

Satellites:

sat_customer_crm -- name, segment, owner, load_date, hash_diff, record_source='CRM'
sat_customer_ecommerce -- signup_date, last_login, preferences, load_date, hash_diff, record_source='ECOM'
sat_customer_support -- plan_tier, ticket_count, load_date, hash_diff, record_source='ZENDESK'

Key Resolution:

Use email (normalized: UPPER + TRIM) as the business key for hub hash generation.
Create a sat_customer_source_keys satellite mapping each source system's native ID to the hub: {crm_customer_id, ecom_user_id, zendesk_contact_id}.
If email is unreliable, add a Link (link_customer_identity) connecting hub records that represent the same person via fuzzy matching.

Parallel Loading:

Each source loads independently: insert into Hub (ON CONFLICT DO NOTHING), then insert new Satellite rows (only when hash_diff changes). No coordination needed because hub inserts are idempotent on the hash key.

Common Misconceptions

"Data Vault replaces star schemas." Data Vault is an integration/historical tracking layer. You still need star schemas (or materialized views) for analytics and BI consumption.
"Data Vault is only for large enterprises." Small teams with a single data source get little benefit from the Hub/Link/Satellite overhead. Data Vault shines with multiple source systems and audit requirements.
"Hash keys eliminate the need for business key analysis." You still must identify the correct business key for each hub. Hashing the wrong key produces a wrong model. Hash keys just make loading faster.
"Satellites should store all attributes from all sources." Each satellite should contain attributes from ONE source or ONE domain. Mixing sources in a satellite defeats the purpose of parallel, independent loading.

In Production

ING Bank uses Data Vault 2.0 to integrate data from 200+ source systems across their global banking operations, with full audit trail for regulatory compliance (Basel III, GDPR).
Uber adopted Data Vault principles for their data lake integration layer, using hash-based keys for parallel loading of ride, payment, and driver data from separate microservices.
USAA (financial services) uses Data Vault to maintain complete history of all customer interactions across banking, insurance, and investment systems for regulatory reporting.
AutoTrader UK uses Data Vault for their vehicle, dealer, and listing data integration, supporting 15+ source systems with automated daily loading.

Quiz

1. What are the three core entity types in Data Vault?

A) Fact, Dimension, Bridge B) Hub, Link, Satellite C) Source, Staging, Target D) Entity, Attribute, Relationship

Answer

B) Hub (unique business keys), Link (relationships between hubs), and Satellite (descriptive attributes with full change history). Together they form a flexible, auditable integration layer.

2. Why are hash keys used instead of natural keys in Data Vault?

A) Hash keys are shorter B) Hash keys enable parallel loading from multiple sources without sequence coordination, and provide consistent key width for joins C) Hash keys are more secure D) Natural keys are not supported in modern databases

Answer

B) Hash keys (MD5/SHA-256 of the business key) have consistent length, enable parallel insert-only loading from multiple sources without needing a central sequence generator, and make joins faster due to fixed-width key comparisons.

3. What is a Business Vault?

A) A physical vault for storing backup tapes B) A layer of derived tables built on top of the Raw Data Vault that applies business rules, calculations, and soft rules C) A vault that stores business documents D) The production database

Answer

B) The Business Vault adds computed attributes, business rules (e.g., "customer lifetime value"), and soft relationships that require business logic. It sits between the Raw Vault (pure source data) and the Information Mart (star schemas for analytics).

4. How does Data Vault handle schema changes from source systems?

A) It requires a full rebuild of the model B) New attributes are added as new columns in existing satellites, or as new satellites -- no existing structure changes C) Schema changes are not supported D) All data must be reloaded from scratch

Answer

B) Adding a new field from a source = add a column to the existing satellite or create a new satellite. Removing a field = stop populating it (NULLs in new rows). The insert-only pattern means historical data is never modified.

5. When would you NOT choose Data Vault?

A) When you have 200+ source systems B) When regulatory compliance requires full audit trails C) When you have a single source system and a small team with no audit requirements D) When you need parallel loading from multiple sources

Answer

C) Data Vault's Hub/Link/Satellite overhead provides little benefit when you have one source, no audit requirements, and a small team. A simple star schema or even a denormalized table is more practical in that scenario.

:::

One-Liner Summary: Data Vault separates business identity (Hubs), relationships (Links), and change history (Satellites) to enable parallel, audit-ready integration from any number of source systems.

Data Vault Modeling ​

Why Data Vault Exists ​

Historical Context ​

The Core Philosophy ​

First Principles ​

The Three Core Components ​

Hubs: Business Entity Identity ​

Satellites: Descriptive Context ​

Links: Relationships ​

Hash Keys ​

Why Hash Keys? ​

Hash Key Generation ​

Loading Patterns ​

Hub Loading ​

Satellite Loading with Change Detection ​

Link Loading ​

Business Vault ​

Same-As Links (Entity Resolution) ​

Point-in-Time (PIT) Tables ​

Performance Characteristics ​

Query Performance ​

Loading Performance ​

Storage Overhead ​

Edge Cases & Failure Modes ​

Multi-Active Satellites ​

Effectivity Satellites ​

Reference Tables ​

Mathematical Foundations ​

Graph Theory Perspective ​

Normalization Level ​

Real-World War Stories ​

Decision Framework ​

When to Use Data Vault ​

Data Vault vs. Star Schema vs. 3NF ​

Advanced Topics ​

Automating Data Vault with dbt ​

Data Vault on Modern Cloud Warehouses ​

Research: Ensemble Modeling ​

Cross-References ​

Related Pages

Data Vault Modeling

Why Data Vault Exists

Historical Context

The Core Philosophy

First Principles

The Three Core Components

Hubs: Business Entity Identity

Satellites: Descriptive Context

Links: Relationships

Hash Keys

Why Hash Keys?

Hash Key Generation

Loading Patterns

Hub Loading

Satellite Loading with Change Detection

Link Loading

Business Vault

Same-As Links (Entity Resolution)

Point-in-Time (PIT) Tables

Performance Characteristics

Query Performance

Loading Performance

Storage Overhead

Edge Cases & Failure Modes

Multi-Active Satellites

Effectivity Satellites

Reference Tables

Mathematical Foundations

Graph Theory Perspective

Normalization Level

Real-World War Stories

Decision Framework

When to Use Data Vault

Data Vault vs. Star Schema vs. 3NF

Advanced Topics

Automating Data Vault with dbt

Data Vault on Modern Cloud Warehouses

Research: Ensemble Modeling

Cross-References