Data Quality Checks

Why Data Quality Checks Exist

Bad data is worse than no data. No data causes a visible failure — dashboards show errors, models refuse to train, reports are empty. Bad data causes silent failures — dashboards show wrong numbers, models train on garbage, executives make decisions based on fiction.

Studies consistently show that data quality issues cost organizations 15-25% of revenue. The cost of fixing bad data increases exponentially the further downstream it gets:

{Cost}_{fix} (s t a g e) = C_{0} \times 10^{s t a g e}

Where stage 0 = source, stage 1 = pipeline, stage 2 = warehouse, stage 3 = report, stage 4 = business decision.

Historical Context

2000s: Manual data quality checks in ETL tools (Informatica Data Quality)
2010s: SQL-based testing (custom scripts, stored procedures)
2018: Great Expectations launched — declarative data quality testing
2019: dbt tests — data quality integrated into transformation workflows
2022: Data contracts emerge — producer-consumer quality agreements
2024-2025: ML-powered anomaly detection for data quality becomes mainstream

First Principles

Dimensions of Data Quality

Dimension	Definition	Example Check
Accuracy	Data reflects reality	Revenue matches financial system
Completeness	Required data is present	email is not null for active users
Consistency	Data agrees across systems	Order count in DW = count in source
Timeliness	Data arrives on schedule	Table updated within last 2 hours
Uniqueness	No unintended duplicates	order_id is unique in fact_orders
Validity	Data conforms to rules	age is between 0 and 150

The Testing Pyramid for Data

dbt Tests

Built-In Tests

sql

-- schema.yml in dbt
version: 2

models:
  - name: fact_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null

      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customer')
              field: customer_id

      - name: order_amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000
              inclusive: true

      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled', 'returned']

Custom dbt Tests

sql

-- tests/assert_total_revenue_positive.sql
-- Custom test: total daily revenue should be positive

SELECT
    order_date,
    SUM(order_amount) as daily_revenue
FROM {{ ref('fact_orders') }}
GROUP BY order_date
HAVING SUM(order_amount) <= 0

sql

-- tests/assert_no_future_orders.sql
-- No orders should have a date in the future

SELECT *
FROM {{ ref('fact_orders') }}
WHERE order_date > CURRENT_DATE

sql

-- tests/assert_row_count_within_range.sql
-- Row count should be within expected range

WITH counts AS (
    SELECT COUNT(*) as row_count
    FROM {{ ref('fact_orders') }}
    WHERE order_date = CURRENT_DATE - INTERVAL '1 day'
)
SELECT *
FROM counts
WHERE row_count < 1000     -- Minimum expected orders per day
   OR row_count > 1000000  -- Maximum expected orders per day

dbt Test Severity Levels

yaml

models:
  - name: fact_orders
    columns:
      - name: order_id
        tests:
          - unique:
              severity: error    # Fails the pipeline
          - not_null:
              severity: error    # Fails the pipeline

      - name: discount_code
        tests:
          - not_null:
              severity: warn     # Logs warning, doesn't fail
              warn_if: ">100"    # Warn if >100 nulls
              error_if: ">1000"  # Fail if >1000 nulls

Great Expectations

Core Concepts

typescript

interface Expectation {
  expectationType: string;
  kwargs: Record<string, unknown>;
  meta?: Record<string, unknown>;
}

interface ExpectationSuite {
  expectationSuiteName: string;
  expectations: Expectation[];
  meta: {
    greatExpectationsVersion: string;
  };
}

interface ValidationResult {
  success: boolean;
  expectationConfig: Expectation;
  result: {
    observedValue: unknown;
    elementCount: number;
    missingCount: number;
    missingPercent: number;
    unexpectedCount: number;
    unexpectedPercent: number;
    partialUnexpectedList: unknown[];
  };
}

// Example expectation suite for an orders table
const ordersSuite: ExpectationSuite = {
  expectationSuiteName: 'orders_quality_suite',
  expectations: [
    // Table-level expectations
    {
      expectationType: 'expect_table_row_count_to_be_between',
      kwargs: { min_value: 10000, max_value: 10000000 },
    },
    {
      expectationType: 'expect_table_column_count_to_equal',
      kwargs: { value: 12 },
    },

    // Column-level expectations
    {
      expectationType: 'expect_column_values_to_not_be_null',
      kwargs: { column: 'order_id' },
    },
    {
      expectationType: 'expect_column_values_to_be_unique',
      kwargs: { column: 'order_id' },
    },
    {
      expectationType: 'expect_column_values_to_be_between',
      kwargs: { column: 'order_amount', min_value: 0, max_value: 1000000 },
    },
    {
      expectationType: 'expect_column_values_to_be_in_set',
      kwargs: {
        column: 'status',
        value_set: ['pending', 'shipped', 'delivered', 'cancelled'],
      },
    },
    {
      expectationType: 'expect_column_mean_to_be_between',
      kwargs: { column: 'order_amount', min_value: 10, max_value: 500 },
    },

    // Distribution expectations
    {
      expectationType: 'expect_column_stdev_to_be_between',
      kwargs: { column: 'order_amount', min_value: 5, max_value: 200 },
    },
  ],
  meta: {
    greatExpectationsVersion: '0.18.0',
  },
};

Implementing Custom Expectations

typescript

interface CustomExpectation {
  name: string;
  validate(data: DataBatch, params: Record<string, unknown>): ValidationResult;
}

class ExpectColumnCorrelation implements CustomExpectation {
  name = 'expect_column_pair_correlation_to_be_between';

  validate(
    data: DataBatch,
    params: { column_a: string; column_b: string; min_value: number; max_value: number },
  ): ValidationResult {
    const valuesA = data.getColumn(params.column_a) as number[];
    const valuesB = data.getColumn(params.column_b) as number[];

    const correlation = this.pearsonCorrelation(valuesA, valuesB);

    return {
      success: correlation >= params.min_value && correlation <= params.max_value,
      expectationConfig: {
        expectationType: this.name,
        kwargs: params,
      },
      result: {
        observedValue: correlation,
        elementCount: valuesA.length,
        missingCount: 0,
        missingPercent: 0,
        unexpectedCount: correlation < params.min_value || correlation > params.max_value ? 1 : 0,
        unexpectedPercent: 0,
        partialUnexpectedList: [],
      },
    };
  }

  private pearsonCorrelation(x: number[], y: number[]): number {
    const n = x.length;
    const sumX = x.reduce((a, b) => a + b, 0);
    const sumY = y.reduce((a, b) => a + b, 0);
    const sumXY = x.reduce((acc, xi, i) => acc + xi * y[i], 0);
    const sumX2 = x.reduce((acc, xi) => acc + xi * xi, 0);
    const sumY2 = y.reduce((acc, yi) => acc + yi * yi, 0);

    const numerator = n * sumXY - sumX * sumY;
    const denominator = Math.sqrt(
      (n * sumX2 - sumX * sumX) * (n * sumY2 - sumY * sumY),
    );

    return denominator === 0 ? 0 : numerator / denominator;
  }
}

interface DataBatch {
  getColumn(name: string): unknown[];
  rowCount(): number;
}

Data Contracts

Contract Definition

typescript

interface DataContract {
  apiVersion: 'v1';
  kind: 'DataContract';
  metadata: {
    name: string;
    version: string;
    owner: string;
    team: string;
    description: string;
  };
  schema: SchemaDefinition;
  quality: QualityDefinition;
  sla: SLADefinition;
}

interface SchemaDefinition {
  type: 'avro' | 'protobuf' | 'json-schema' | 'sql-ddl';
  fields: Array<{
    name: string;
    type: string;
    description: string;
    required: boolean;
    pii: boolean;
    businessDefinition: string;
    examples: unknown[];
  }>;
}

interface QualityDefinition {
  completeness: Array<{
    field: string;
    threshold: number;  // 0.0 to 1.0
  }>;
  uniqueness: Array<{
    fields: string[];
  }>;
  freshness: {
    maxAgeMinutes: number;
    checkColumn: string;
  };
  customRules: Array<{
    name: string;
    sql: string;
    threshold: number;
  }>;
}

interface SLADefinition {
  availability: number;       // e.g., 0.999
  latency: {
    maxMinutes: number;
    p99Minutes: number;
  };
  support: {
    team: string;
    slackChannel: string;
    oncallRotation: string;
  };
}

// Example data contract
const ordersContract: DataContract = {
  apiVersion: 'v1',
  kind: 'DataContract',
  metadata: {
    name: 'orders',
    version: '2.1.0',
    owner: 'order-service-team',
    team: 'commerce',
    description: 'All customer orders from the commerce platform',
  },
  schema: {
    type: 'json-schema',
    fields: [
      {
        name: 'order_id',
        type: 'string',
        description: 'Unique order identifier (UUID v4)',
        required: true,
        pii: false,
        businessDefinition: 'System-generated unique ID for each order',
        examples: ['550e8400-e29b-41d4-a716-446655440000'],
      },
      {
        name: 'customer_email',
        type: 'string',
        description: 'Customer email address',
        required: true,
        pii: true,
        businessDefinition: 'Email used for order confirmation',
        examples: ['user@example.com'],
      },
    ],
  },
  quality: {
    completeness: [
      { field: 'order_id', threshold: 1.0 },
      { field: 'customer_email', threshold: 0.99 },
      { field: 'order_amount', threshold: 1.0 },
    ],
    uniqueness: [{ fields: ['order_id'] }],
    freshness: {
      maxAgeMinutes: 60,
      checkColumn: 'created_at',
    },
    customRules: [
      {
        name: 'positive_amount',
        sql: "SELECT COUNT(*) FROM orders WHERE order_amount <= 0",
        threshold: 0,
      },
      {
        name: 'valid_status',
        sql: "SELECT COUNT(*) FROM orders WHERE status NOT IN ('pending','shipped','delivered','cancelled')",
        threshold: 0,
      },
    ],
  },
  sla: {
    availability: 0.999,
    latency: { maxMinutes: 60, p99Minutes: 30 },
    support: {
      team: 'commerce-team',
      slackChannel: '#commerce-data',
      oncallRotation: 'commerce-oncall',
    },
  },
};

Contract Validation Engine

typescript

class DataContractValidator {
  async validate(
    contract: DataContract,
    db: Database,
    tableName: string,
  ): Promise<ContractValidationResult> {
    const results: CheckResult[] = [];

    // Schema checks
    for (const field of contract.schema.fields) {
      if (field.required) {
        const nullCount = await this.countNulls(db, tableName, field.name);
        results.push({
          check: `${field.name}_not_null`,
          passed: nullCount === 0,
          details: `${nullCount} null values found`,
        });
      }
    }

    // Completeness checks
    for (const rule of contract.quality.completeness) {
      const completeness = await this.measureCompleteness(
        db, tableName, rule.field,
      );
      results.push({
        check: `${rule.field}_completeness`,
        passed: completeness >= rule.threshold,
        details: `Completeness: ${(completeness * 100).toFixed(2)}% (threshold: ${(rule.threshold * 100).toFixed(2)}%)`,
      });
    }

    // Uniqueness checks
    for (const rule of contract.quality.uniqueness) {
      const duplicates = await this.countDuplicates(
        db, tableName, rule.fields,
      );
      results.push({
        check: `${rule.fields.join('_')}_unique`,
        passed: duplicates === 0,
        details: `${duplicates} duplicate combinations found`,
      });
    }

    // Freshness check
    const freshness = contract.quality.freshness;
    const maxAge = await this.measureFreshness(
      db, tableName, freshness.checkColumn,
    );
    results.push({
      check: 'freshness',
      passed: maxAge <= freshness.maxAgeMinutes,
      details: `Data age: ${maxAge} minutes (max: ${freshness.maxAgeMinutes})`,
    });

    // Custom rules
    for (const rule of contract.quality.customRules) {
      const count = await this.executeCustomRule(db, rule.sql);
      results.push({
        check: rule.name,
        passed: count <= rule.threshold,
        details: `Violations: ${count} (threshold: ${rule.threshold})`,
      });
    }

    return {
      contractName: contract.metadata.name,
      contractVersion: contract.metadata.version,
      timestamp: new Date(),
      allPassed: results.every((r) => r.passed),
      results,
    };
  }

  private async countNulls(
    db: Database, table: string, column: string,
  ): Promise<number> {
    const result = await db.query(
      `SELECT COUNT(*) as cnt FROM ${table} WHERE ${column} IS NULL`, [],
    );
    return (result as any).rows[0].cnt;
  }

  private async measureCompleteness(
    db: Database, table: string, column: string,
  ): Promise<number> {
    const result = await db.query(
      `SELECT
         COUNT(*) as total,
         COUNT(${column}) as non_null
       FROM ${table}`, [],
    );
    const row = (result as any).rows[0];
    return row.total > 0 ? row.non_null / row.total : 1;
  }

  private async countDuplicates(
    db: Database, table: string, columns: string[],
  ): Promise<number> {
    const cols = columns.join(', ');
    const result = await db.query(
      `SELECT COUNT(*) as cnt FROM (
         SELECT ${cols}, COUNT(*) as c FROM ${table}
         GROUP BY ${cols} HAVING COUNT(*) > 1
       ) dupes`, [],
    );
    return (result as any).rows[0].cnt;
  }

  private async measureFreshness(
    db: Database, table: string, column: string,
  ): Promise<number> {
    const result = await db.query(
      `SELECT EXTRACT(EPOCH FROM (NOW() - MAX(${column}))) / 60 as age_minutes
       FROM ${table}`, [],
    );
    return Math.round((result as any).rows[0].age_minutes);
  }

  private async executeCustomRule(db: Database, sql: string): Promise<number> {
    const result = await db.query(sql, []);
    return (result as any).rows[0]?.count ?? (result as any).rows.length;
  }
}

interface CheckResult {
  check: string;
  passed: boolean;
  details: string;
}

interface ContractValidationResult {
  contractName: string;
  contractVersion: string;
  timestamp: Date;
  allPassed: boolean;
  results: CheckResult[];
}

interface Database {
  query(sql: string, params: unknown[]): Promise<unknown>;
}

Anomaly Detection

Statistical Anomaly Detection

typescript

class StatisticalAnomalyDetector {
  private history: Map<string, number[]> = new Map();

  recordMetric(metricName: string, value: number): void {
    const values = this.history.get(metricName) ?? [];
    values.push(value);
    // Keep last 90 days
    if (values.length > 90) values.shift();
    this.history.set(metricName, values);
  }

  detectAnomaly(
    metricName: string,
    currentValue: number,
    sensitivitySigma: number = 3,
  ): AnomalyResult {
    const history = this.history.get(metricName);
    if (!history || history.length < 7) {
      return { isAnomaly: false, reason: 'Insufficient history' };
    }

    const mean = history.reduce((a, b) => a + b, 0) / history.length;
    const variance =
      history.reduce((acc, v) => acc + Math.pow(v - mean, 2), 0) /
      history.length;
    const stddev = Math.sqrt(variance);

    const zScore = stddev > 0 ? (currentValue - mean) / stddev : 0;

    return {
      isAnomaly: Math.abs(zScore) > sensitivitySigma,
      reason: Math.abs(zScore) > sensitivitySigma
        ? `z-score ${zScore.toFixed(2)} exceeds ${sensitivitySigma} sigma threshold`
        : 'Within normal range',
      details: {
        currentValue,
        mean,
        stddev,
        zScore,
        lowerBound: mean - sensitivitySigma * stddev,
        upperBound: mean + sensitivitySigma * stddev,
      },
    };
  }
}

interface AnomalyResult {
  isAnomaly: boolean;
  reason: string;
  details?: {
    currentValue: number;
    mean: number;
    stddev: number;
    zScore: number;
    lowerBound: number;
    upperBound: number;
  };
}

Row Count Monitoring

Expected row count (t) = μ + A \sin (\frac{2 π t}{P})

Where $μ$ is the mean, $A$ is the seasonal amplitude, and $P$ is the period (typically 7 days for weekly seasonality).

Performance Characteristics

Testing Overhead

Test Type	Overhead	Frequency
NOT NULL	O(n) scan	Every load
UNIQUE	O(n log n) sort	Every load
Row count	O(1) metadata	Every load
Referential integrity	O(n) join	Every load
Distribution checks	O(n) aggregate	Daily
Cross-system reconciliation	O(n) per system	Daily/Weekly

Cost-Benefit of Data Quality

{ROI}_{DQ} = \frac{Cost of bad data prevented}{Cost of quality checks}

Typical ratios: 10:1 to 100:1. The cost of a bad dashboard decision far exceeds the compute cost of running quality checks.

Edge Cases & Failure Modes

False Positives

Overly strict quality checks block valid data:

Holiday traffic patterns trigger row count anomaly alerts
Legitimate new product categories fail accepted_values tests
Seasonal revenue fluctuations breach statistical bounds

Mitigation: Use dynamic thresholds based on historical patterns, not static rules.

The Observer Effect

Quality checks that run heavy queries can impact the production database:

typescript

class ThrottledQualityChecker {
  private readonly maxConcurrentChecks: number;
  private activeChecks: number = 0;

  constructor(maxConcurrent: number = 3) {
    this.maxConcurrentChecks = maxConcurrent;
  }

  async runCheck<T>(check: () => Promise<T>): Promise<T> {
    while (this.activeChecks >= this.maxConcurrentChecks) {
      await new Promise((r) => setTimeout(r, 1000));
    }
    this.activeChecks++;
    try {
      return await check();
    } finally {
      this.activeChecks--;
    }
  }
}

Real-World War Stories

War Story

The Silent Schema Change

A source team changed a column from VARCHAR(100) to VARCHAR(50) in their API response. No notification was sent to downstream consumers. The data pipeline silently truncated long values, causing 2,000 customer names to be cut off in the CRM.

The issue was discovered 2 weeks later when a customer complained that their name appeared as "Christopher Alexande" in all communications.

Fix: Added schema validation at ingestion: column types and lengths are checked against a contract. Any schema change triggers an alert before data is loaded.

War Story

The Duplicate Order Disaster

A payment service had an idempotency bug that caused 0.3% of orders to be duplicated. The data pipeline loaded these duplicates into the warehouse. Total revenue reported was $45M instead of $44.85M — a $150K discrepancy.

The error compounded: marketing budgets were set based on the inflated revenue, leading to overspending. The sales team received incorrect commissions. Month-end reconciliation caught it, but by then the damage was done.

Fix: Added a uniqueness check on (order_id) as a BLOCKING test — the pipeline fails if duplicates are detected. Added a daily reconciliation between the warehouse and the payment system's transaction log.

Decision Framework

Quality Check Priority

Check Type	Priority	When to Implement
NOT NULL on PKs	Critical	Day 1
UNIQUE on PKs	Critical	Day 1
Row count bounds	High	Week 1
Referential integrity	High	Week 1
Freshness	High	Week 1
Value range checks	Medium	Month 1
Distribution checks	Medium	Month 1
Cross-system reconciliation	Medium	Quarter 1
ML anomaly detection	Low	Year 1

Advanced Topics

Data Quality Scoring

Assign a composite quality score to each dataset:

Quality Score = \sum_{i = 1}^{n} w_{i} \times s_{i}

Where $w_{i}$ is the weight and $s_{i}$ is the score (0 or 1) for each check.

Data Observability

Beyond testing, continuous monitoring of data health:

typescript

interface DataObservabilityMetrics {
  freshness: Date;
  volume: { rowCount: number; byteSize: number };
  schema: { columnCount: number; typeHash: string };
  distribution: Record<string, { mean: number; stddev: number; nullPct: number }>;
  lineage: { upstreamTables: string[]; downstreamTables: string[] };
}

Cross-References

Pipeline Patterns Overview — Quality checks in pipeline context
Data Lineage — Tracing quality issues to their source
Testing Data Pipelines — Broader testing strategies
Schema Evolution — Schema validation as quality check
Data Contracts — Formal quality agreements

Key Takeaway

Data quality checks should be embedded in pipelines as gates between processing stages -- bad data caught early costs orders of magnitude less to fix than bad data caught downstream.
Use a layered approach: schema validation (shape), statistical checks (distribution), business rules (logic), and freshness checks (timeliness).
Data contracts formalize quality expectations between producers and consumers, shifting responsibility for data quality to the teams that produce it.

Exercise

Design Data Quality Gates for a Medallion Architecture

You are implementing Bronze/Silver/Gold layers for a customer analytics pipeline. Design data quality checks for each transition:

Bronze -> Silver gate: What checks prevent bad data from entering the Silver layer?
Silver -> Gold gate: What checks ensure business-ready data quality?
Gold -> Dashboard gate: What checks prevent wrong numbers on executive dashboards?

For each gate, specify: the check type, threshold, and action on failure (block, warn, or quarantine).

Solution

Bronze -> Silver Gate:

Schema validation: all required columns present, correct types. Action: BLOCK (reject entire batch).
Null check: customer_id NOT NULL. Action: QUARANTINE (route null-ID records to DLQ, pass valid records).
Dedup: no duplicate customer_id per batch. Action: AUTO-FIX (keep most recent by updated_at).
Freshness: data timestamp within 24 hours of current time. Action: WARN (alert data engineering team).

Silver -> Gold Gate:

Referential integrity: all customer_id in fact table exist in dim_customer. Action: BLOCK (missing dimensions must be resolved first).
Statistical checks: row count within 2 standard deviations of 30-day rolling average. Action: WARN (alert if volume anomaly).
Business rules: lifetime_value >= 0, signup_date <= current_date. Action: QUARANTINE (route violations to review).
Completeness: key metrics (revenue, count) not null for any required segment. Action: BLOCK.

Gold -> Dashboard Gate:

Cross-check: total revenue matches finance system to within 0.1%. Action: BLOCK (do not publish to dashboard until reconciled).
Trend check: no metric changes more than 50% day-over-day without explanation. Action: WARN (flag for manual review).
Timeliness: Gold tables refreshed within SLA (by 8 AM UTC). Action: ALERT (escalate to PagerDuty if SLA breached).

Common Misconceptions

"Data quality is the data engineering team's responsibility." Data quality is a shared responsibility. Producers should validate at the source (data contracts), engineering should validate in the pipeline, and consumers should validate at the point of use.
"100% data quality is achievable." Real-world data is messy. The goal is to detect and handle issues, not prevent them entirely. Design for graceful degradation, not perfection.
"Schema validation is sufficient for data quality." Data can have the right schema but wrong values (negative ages, future dates, impossible combinations). Schema validation checks shape; you also need value-level and statistical checks.
"Data quality checks should block the pipeline on any failure." Minor quality issues (0.1% null rate) should warn, not block. Blocking thresholds should be calibrated to avoid false positives that delay critical pipelines.
"Great Expectations and dbt tests solve the same problem." dbt tests run inside the warehouse on materialized data. Great Expectations runs anywhere in the pipeline (before load, during transform, in CI/CD). They are complementary.

In Production

Airbnb uses a data quality score (0-100) for every dataset, computed from completeness, accuracy, freshness, and consistency checks. Datasets below 80 are flagged for remediation.
Uber implements data contracts between producer and consumer teams: the producing team defines and enforces quality SLAs, and any breaking change requires consumer team sign-off.
Netflix runs statistical anomaly detection on key metrics (viewing hours, signup counts) that automatically flags and quarantines data batches that deviate significantly from historical patterns.
LinkedIn built their own data quality framework that runs checks at every layer of their medallion architecture, with automated escalation paths based on severity and business impact.

Quiz

1. What is the cost multiplier for fixing data quality issues at each stage?

A) Cost is the same at every stage B) Roughly 10x per stage downstream -- fixing at source costs $1, at ingestion costs $10, at warehouse costs $100, at dashboard costs $1000 C) Cost decreases as data moves downstream D) Cost only matters at the final stage

Answer

B) Fixing bad data at the source is cheapest. Once bad data flows into the warehouse, it contaminates downstream tables. By the time it reaches dashboards and decisions, the cost includes reprocessing, retracting reports, and restoring trust.

2. What is a data contract?

A) A legal agreement about data ownership B) A formal specification between data producer and consumer defining schema, quality standards, SLAs, and change management procedures C) A database access control policy D) A data encryption standard

Answer

B) A data contract defines what the producer promises (schema, freshness, completeness) and what the consumer expects. It includes SLAs (99.9% completeness), change management (30-day notice for breaking changes), and enforcement (automated validation).

3. When should a data quality check BLOCK the pipeline vs WARN?

A) Always block -- data quality is critical B) Block when the issue affects data correctness for ALL downstream consumers; warn when the issue is minor or affects only some consumers C) Never block -- warnings are sufficient D) Block only in production, warn in development

Answer

B) Blocking thresholds should reflect business impact. A null primary key (data correctness issue) should block. A 0.1% increase in null optional fields should warn. Calibrate thresholds to avoid false positives that delay critical pipelines.

4. What is statistical anomaly detection in data quality?

A) Using machine learning to clean data B) Comparing current data metrics (row count, null rate, value distribution) against historical baselines and flagging significant deviations C) Running statistical tests on SQL queries D) Detecting duplicate records

Answer

B) Statistical checks compare current metrics against rolling averages (e.g., 30-day). If today's row count is 3 standard deviations below the average, it is flagged as an anomaly. This catches volume drops, missing sources, and upstream failures automatically.

5. How do Great Expectations and dbt tests complement each other?

A) They are redundant and you should choose one B) Great Expectations validates data anywhere in the pipeline (before load, in Python, in CI/CD); dbt tests validate materialized data inside the warehouse C) Great Expectations is for streaming; dbt tests are for batch D) Great Expectations replaces dbt tests entirely

Answer

B) Great Expectations can validate data in Python scripts, Spark jobs, or CI/CD pipelines before data enters the warehouse. dbt tests validate data after it is materialized in the warehouse (unique, not_null, relationships). Using both provides defense in depth.

:::

One-Liner Summary: Embed data quality checks as gates between pipeline layers -- catch issues at ingestion for $1 instead of at the executive dashboard for $1000.

Data Quality Checks ​

Why Data Quality Checks Exist ​

Historical Context ​

First Principles ​

Dimensions of Data Quality ​

The Testing Pyramid for Data ​

dbt Tests ​

Built-In Tests ​

Custom dbt Tests ​

dbt Test Severity Levels ​

Great Expectations ​

Core Concepts ​

Implementing Custom Expectations ​

Data Contracts ​

Contract Definition ​

Contract Validation Engine ​

Anomaly Detection ​

Statistical Anomaly Detection ​

Row Count Monitoring ​

Performance Characteristics ​

Testing Overhead ​

Cost-Benefit of Data Quality ​

Edge Cases & Failure Modes ​

False Positives ​

The Observer Effect ​

Real-World War Stories ​

Decision Framework ​

Quality Check Priority ​

Advanced Topics ​

Data Quality Scoring ​

Data Observability ​

Cross-References ​

Related Pages

Data Quality Checks

Why Data Quality Checks Exist

Historical Context

First Principles

Dimensions of Data Quality

The Testing Pyramid for Data

dbt Tests

Built-In Tests

Custom dbt Tests

dbt Test Severity Levels

Great Expectations

Core Concepts

Implementing Custom Expectations

Data Contracts

Contract Definition

Contract Validation Engine

Anomaly Detection

Statistical Anomaly Detection

Row Count Monitoring

Performance Characteristics

Testing Overhead

Cost-Benefit of Data Quality

Edge Cases & Failure Modes

False Positives

The Observer Effect

Real-World War Stories

Decision Framework

Quality Check Priority

Advanced Topics

Data Quality Scoring

Data Observability

Cross-References