Skip to content
Unverified — AI-generated content. Help verify this page

Data Quality Checks

Why Data Quality Checks Exist

Bad data is worse than no data. No data causes a visible failure — dashboards show errors, models refuse to train, reports are empty. Bad data causes silent failures — dashboards show wrong numbers, models train on garbage, executives make decisions based on fiction.

Studies consistently show that data quality issues cost organizations 15-25% of revenue. The cost of fixing bad data increases exponentially the further downstream it gets:

Costfix(stage)=C0×10stage

Where stage 0 = source, stage 1 = pipeline, stage 2 = warehouse, stage 3 = report, stage 4 = business decision.

Historical Context

  • 2000s: Manual data quality checks in ETL tools (Informatica Data Quality)
  • 2010s: SQL-based testing (custom scripts, stored procedures)
  • 2018: Great Expectations launched — declarative data quality testing
  • 2019: dbt tests — data quality integrated into transformation workflows
  • 2022: Data contracts emerge — producer-consumer quality agreements
  • 2024-2025: ML-powered anomaly detection for data quality becomes mainstream

First Principles

Dimensions of Data Quality

DimensionDefinitionExample Check
AccuracyData reflects realityRevenue matches financial system
CompletenessRequired data is presentemail is not null for active users
ConsistencyData agrees across systemsOrder count in DW = count in source
TimelinessData arrives on scheduleTable updated within last 2 hours
UniquenessNo unintended duplicatesorder_id is unique in fact_orders
ValidityData conforms to rulesage is between 0 and 150

The Testing Pyramid for Data

dbt Tests

Built-In Tests

sql
-- schema.yml in dbt
version: 2

models:
  - name: fact_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null

      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customer')
              field: customer_id

      - name: order_amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000
              inclusive: true

      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled', 'returned']

Custom dbt Tests

sql
-- tests/assert_total_revenue_positive.sql
-- Custom test: total daily revenue should be positive

SELECT
    order_date,
    SUM(order_amount) as daily_revenue
FROM {​{ ref('fact_orders') }}
GROUP BY order_date
HAVING SUM(order_amount) <= 0
sql
-- tests/assert_no_future_orders.sql
-- No orders should have a date in the future

SELECT *
FROM {​{ ref('fact_orders') }}
WHERE order_date > CURRENT_DATE
sql
-- tests/assert_row_count_within_range.sql
-- Row count should be within expected range

WITH counts AS (
    SELECT COUNT(*) as row_count
    FROM {​{ ref('fact_orders') }}
    WHERE order_date = CURRENT_DATE - INTERVAL '1 day'
)
SELECT *
FROM counts
WHERE row_count < 1000     -- Minimum expected orders per day
   OR row_count > 1000000  -- Maximum expected orders per day

dbt Test Severity Levels

yaml
models:
  - name: fact_orders
    columns:
      - name: order_id
        tests:
          - unique:
              severity: error    # Fails the pipeline
          - not_null:
              severity: error    # Fails the pipeline

      - name: discount_code
        tests:
          - not_null:
              severity: warn     # Logs warning, doesn't fail
              warn_if: ">100"    # Warn if >100 nulls
              error_if: ">1000"  # Fail if >1000 nulls

Great Expectations

Core Concepts

typescript
interface Expectation {
  expectationType: string;
  kwargs: Record<string, unknown>;
  meta?: Record<string, unknown>;
}

interface ExpectationSuite {
  expectationSuiteName: string;
  expectations: Expectation[];
  meta: {
    greatExpectationsVersion: string;
  };
}

interface ValidationResult {
  success: boolean;
  expectationConfig: Expectation;
  result: {
    observedValue: unknown;
    elementCount: number;
    missingCount: number;
    missingPercent: number;
    unexpectedCount: number;
    unexpectedPercent: number;
    partialUnexpectedList: unknown[];
  };
}

// Example expectation suite for an orders table
const ordersSuite: ExpectationSuite = {
  expectationSuiteName: 'orders_quality_suite',
  expectations: [
    // Table-level expectations
    {
      expectationType: 'expect_table_row_count_to_be_between',
      kwargs: { min_value: 10000, max_value: 10000000 },
    },
    {
      expectationType: 'expect_table_column_count_to_equal',
      kwargs: { value: 12 },
    },

    // Column-level expectations
    {
      expectationType: 'expect_column_values_to_not_be_null',
      kwargs: { column: 'order_id' },
    },
    {
      expectationType: 'expect_column_values_to_be_unique',
      kwargs: { column: 'order_id' },
    },
    {
      expectationType: 'expect_column_values_to_be_between',
      kwargs: { column: 'order_amount', min_value: 0, max_value: 1000000 },
    },
    {
      expectationType: 'expect_column_values_to_be_in_set',
      kwargs: {
        column: 'status',
        value_set: ['pending', 'shipped', 'delivered', 'cancelled'],
      },
    },
    {
      expectationType: 'expect_column_mean_to_be_between',
      kwargs: { column: 'order_amount', min_value: 10, max_value: 500 },
    },

    // Distribution expectations
    {
      expectationType: 'expect_column_stdev_to_be_between',
      kwargs: { column: 'order_amount', min_value: 5, max_value: 200 },
    },
  ],
  meta: {
    greatExpectationsVersion: '0.18.0',
  },
};

Implementing Custom Expectations

typescript
interface CustomExpectation {
  name: string;
  validate(data: DataBatch, params: Record<string, unknown>): ValidationResult;
}

class ExpectColumnCorrelation implements CustomExpectation {
  name = 'expect_column_pair_correlation_to_be_between';

  validate(
    data: DataBatch,
    params: { column_a: string; column_b: string; min_value: number; max_value: number },
  ): ValidationResult {
    const valuesA = data.getColumn(params.column_a) as number[];
    const valuesB = data.getColumn(params.column_b) as number[];

    const correlation = this.pearsonCorrelation(valuesA, valuesB);

    return {
      success: correlation >= params.min_value && correlation <= params.max_value,
      expectationConfig: {
        expectationType: this.name,
        kwargs: params,
      },
      result: {
        observedValue: correlation,
        elementCount: valuesA.length,
        missingCount: 0,
        missingPercent: 0,
        unexpectedCount: correlation < params.min_value || correlation > params.max_value ? 1 : 0,
        unexpectedPercent: 0,
        partialUnexpectedList: [],
      },
    };
  }

  private pearsonCorrelation(x: number[], y: number[]): number {
    const n = x.length;
    const sumX = x.reduce((a, b) => a + b, 0);
    const sumY = y.reduce((a, b) => a + b, 0);
    const sumXY = x.reduce((acc, xi, i) => acc + xi * y[i], 0);
    const sumX2 = x.reduce((acc, xi) => acc + xi * xi, 0);
    const sumY2 = y.reduce((acc, yi) => acc + yi * yi, 0);

    const numerator = n * sumXY - sumX * sumY;
    const denominator = Math.sqrt(
      (n * sumX2 - sumX * sumX) * (n * sumY2 - sumY * sumY),
    );

    return denominator === 0 ? 0 : numerator / denominator;
  }
}

interface DataBatch {
  getColumn(name: string): unknown[];
  rowCount(): number;
}

Data Contracts

Contract Definition

typescript
interface DataContract {
  apiVersion: 'v1';
  kind: 'DataContract';
  metadata: {
    name: string;
    version: string;
    owner: string;
    team: string;
    description: string;
  };
  schema: SchemaDefinition;
  quality: QualityDefinition;
  sla: SLADefinition;
}

interface SchemaDefinition {
  type: 'avro' | 'protobuf' | 'json-schema' | 'sql-ddl';
  fields: Array<{
    name: string;
    type: string;
    description: string;
    required: boolean;
    pii: boolean;
    businessDefinition: string;
    examples: unknown[];
  }>;
}

interface QualityDefinition {
  completeness: Array<{
    field: string;
    threshold: number;  // 0.0 to 1.0
  }>;
  uniqueness: Array<{
    fields: string[];
  }>;
  freshness: {
    maxAgeMinutes: number;
    checkColumn: string;
  };
  customRules: Array<{
    name: string;
    sql: string;
    threshold: number;
  }>;
}

interface SLADefinition {
  availability: number;       // e.g., 0.999
  latency: {
    maxMinutes: number;
    p99Minutes: number;
  };
  support: {
    team: string;
    slackChannel: string;
    oncallRotation: string;
  };
}

// Example data contract
const ordersContract: DataContract = {
  apiVersion: 'v1',
  kind: 'DataContract',
  metadata: {
    name: 'orders',
    version: '2.1.0',
    owner: 'order-service-team',
    team: 'commerce',
    description: 'All customer orders from the commerce platform',
  },
  schema: {
    type: 'json-schema',
    fields: [
      {
        name: 'order_id',
        type: 'string',
        description: 'Unique order identifier (UUID v4)',
        required: true,
        pii: false,
        businessDefinition: 'System-generated unique ID for each order',
        examples: ['550e8400-e29b-41d4-a716-446655440000'],
      },
      {
        name: 'customer_email',
        type: 'string',
        description: 'Customer email address',
        required: true,
        pii: true,
        businessDefinition: 'Email used for order confirmation',
        examples: ['user@example.com'],
      },
    ],
  },
  quality: {
    completeness: [
      { field: 'order_id', threshold: 1.0 },
      { field: 'customer_email', threshold: 0.99 },
      { field: 'order_amount', threshold: 1.0 },
    ],
    uniqueness: [{ fields: ['order_id'] }],
    freshness: {
      maxAgeMinutes: 60,
      checkColumn: 'created_at',
    },
    customRules: [
      {
        name: 'positive_amount',
        sql: "SELECT COUNT(*) FROM orders WHERE order_amount <= 0",
        threshold: 0,
      },
      {
        name: 'valid_status',
        sql: "SELECT COUNT(*) FROM orders WHERE status NOT IN ('pending','shipped','delivered','cancelled')",
        threshold: 0,
      },
    ],
  },
  sla: {
    availability: 0.999,
    latency: { maxMinutes: 60, p99Minutes: 30 },
    support: {
      team: 'commerce-team',
      slackChannel: '#commerce-data',
      oncallRotation: 'commerce-oncall',
    },
  },
};

Contract Validation Engine

typescript
class DataContractValidator {
  async validate(
    contract: DataContract,
    db: Database,
    tableName: string,
  ): Promise<ContractValidationResult> {
    const results: CheckResult[] = [];

    // Schema checks
    for (const field of contract.schema.fields) {
      if (field.required) {
        const nullCount = await this.countNulls(db, tableName, field.name);
        results.push({
          check: `${field.name}_not_null`,
          passed: nullCount === 0,
          details: `${nullCount} null values found`,
        });
      }
    }

    // Completeness checks
    for (const rule of contract.quality.completeness) {
      const completeness = await this.measureCompleteness(
        db, tableName, rule.field,
      );
      results.push({
        check: `${rule.field}_completeness`,
        passed: completeness >= rule.threshold,
        details: `Completeness: ${(completeness * 100).toFixed(2)}% (threshold: ${(rule.threshold * 100).toFixed(2)}%)`,
      });
    }

    // Uniqueness checks
    for (const rule of contract.quality.uniqueness) {
      const duplicates = await this.countDuplicates(
        db, tableName, rule.fields,
      );
      results.push({
        check: `${rule.fields.join('_')}_unique`,
        passed: duplicates === 0,
        details: `${duplicates} duplicate combinations found`,
      });
    }

    // Freshness check
    const freshness = contract.quality.freshness;
    const maxAge = await this.measureFreshness(
      db, tableName, freshness.checkColumn,
    );
    results.push({
      check: 'freshness',
      passed: maxAge <= freshness.maxAgeMinutes,
      details: `Data age: ${maxAge} minutes (max: ${freshness.maxAgeMinutes})`,
    });

    // Custom rules
    for (const rule of contract.quality.customRules) {
      const count = await this.executeCustomRule(db, rule.sql);
      results.push({
        check: rule.name,
        passed: count <= rule.threshold,
        details: `Violations: ${count} (threshold: ${rule.threshold})`,
      });
    }

    return {
      contractName: contract.metadata.name,
      contractVersion: contract.metadata.version,
      timestamp: new Date(),
      allPassed: results.every((r) => r.passed),
      results,
    };
  }

  private async countNulls(
    db: Database, table: string, column: string,
  ): Promise<number> {
    const result = await db.query(
      `SELECT COUNT(*) as cnt FROM ${table} WHERE ${column} IS NULL`, [],
    );
    return (result as any).rows[0].cnt;
  }

  private async measureCompleteness(
    db: Database, table: string, column: string,
  ): Promise<number> {
    const result = await db.query(
      `SELECT
         COUNT(*) as total,
         COUNT(${column}) as non_null
       FROM ${table}`, [],
    );
    const row = (result as any).rows[0];
    return row.total > 0 ? row.non_null / row.total : 1;
  }

  private async countDuplicates(
    db: Database, table: string, columns: string[],
  ): Promise<number> {
    const cols = columns.join(', ');
    const result = await db.query(
      `SELECT COUNT(*) as cnt FROM (
         SELECT ${cols}, COUNT(*) as c FROM ${table}
         GROUP BY ${cols} HAVING COUNT(*) > 1
       ) dupes`, [],
    );
    return (result as any).rows[0].cnt;
  }

  private async measureFreshness(
    db: Database, table: string, column: string,
  ): Promise<number> {
    const result = await db.query(
      `SELECT EXTRACT(EPOCH FROM (NOW() - MAX(${column}))) / 60 as age_minutes
       FROM ${table}`, [],
    );
    return Math.round((result as any).rows[0].age_minutes);
  }

  private async executeCustomRule(db: Database, sql: string): Promise<number> {
    const result = await db.query(sql, []);
    return (result as any).rows[0]?.count ?? (result as any).rows.length;
  }
}

interface CheckResult {
  check: string;
  passed: boolean;
  details: string;
}

interface ContractValidationResult {
  contractName: string;
  contractVersion: string;
  timestamp: Date;
  allPassed: boolean;
  results: CheckResult[];
}

interface Database {
  query(sql: string, params: unknown[]): Promise<unknown>;
}

Anomaly Detection

Statistical Anomaly Detection

typescript
class StatisticalAnomalyDetector {
  private history: Map<string, number[]> = new Map();

  recordMetric(metricName: string, value: number): void {
    const values = this.history.get(metricName) ?? [];
    values.push(value);
    // Keep last 90 days
    if (values.length > 90) values.shift();
    this.history.set(metricName, values);
  }

  detectAnomaly(
    metricName: string,
    currentValue: number,
    sensitivitySigma: number = 3,
  ): AnomalyResult {
    const history = this.history.get(metricName);
    if (!history || history.length < 7) {
      return { isAnomaly: false, reason: 'Insufficient history' };
    }

    const mean = history.reduce((a, b) => a + b, 0) / history.length;
    const variance =
      history.reduce((acc, v) => acc + Math.pow(v - mean, 2), 0) /
      history.length;
    const stddev = Math.sqrt(variance);

    const zScore = stddev > 0 ? (currentValue - mean) / stddev : 0;

    return {
      isAnomaly: Math.abs(zScore) > sensitivitySigma,
      reason: Math.abs(zScore) > sensitivitySigma
        ? `z-score ${zScore.toFixed(2)} exceeds ${sensitivitySigma} sigma threshold`
        : 'Within normal range',
      details: {
        currentValue,
        mean,
        stddev,
        zScore,
        lowerBound: mean - sensitivitySigma * stddev,
        upperBound: mean + sensitivitySigma * stddev,
      },
    };
  }
}

interface AnomalyResult {
  isAnomaly: boolean;
  reason: string;
  details?: {
    currentValue: number;
    mean: number;
    stddev: number;
    zScore: number;
    lowerBound: number;
    upperBound: number;
  };
}

Row Count Monitoring

Expected row count(t)=μ+Asin(2πtP)

Where μ is the mean, A is the seasonal amplitude, and P is the period (typically 7 days for weekly seasonality).

Performance Characteristics

Testing Overhead

Test TypeOverheadFrequency
NOT NULLO(n) scanEvery load
UNIQUEO(n log n) sortEvery load
Row countO(1) metadataEvery load
Referential integrityO(n) joinEvery load
Distribution checksO(n) aggregateDaily
Cross-system reconciliationO(n) per systemDaily/Weekly

Cost-Benefit of Data Quality

ROIDQ=Cost of bad data preventedCost of quality checks

Typical ratios: 10:1 to 100:1. The cost of a bad dashboard decision far exceeds the compute cost of running quality checks.

Edge Cases & Failure Modes

False Positives

Overly strict quality checks block valid data:

  • Holiday traffic patterns trigger row count anomaly alerts
  • Legitimate new product categories fail accepted_values tests
  • Seasonal revenue fluctuations breach statistical bounds

Mitigation: Use dynamic thresholds based on historical patterns, not static rules.

The Observer Effect

Quality checks that run heavy queries can impact the production database:

typescript
class ThrottledQualityChecker {
  private readonly maxConcurrentChecks: number;
  private activeChecks: number = 0;

  constructor(maxConcurrent: number = 3) {
    this.maxConcurrentChecks = maxConcurrent;
  }

  async runCheck<T>(check: () => Promise<T>): Promise<T> {
    while (this.activeChecks >= this.maxConcurrentChecks) {
      await new Promise((r) => setTimeout(r, 1000));
    }
    this.activeChecks++;
    try {
      return await check();
    } finally {
      this.activeChecks--;
    }
  }
}

Real-World War Stories

War Story

The Silent Schema Change

A source team changed a column from VARCHAR(100) to VARCHAR(50) in their API response. No notification was sent to downstream consumers. The data pipeline silently truncated long values, causing 2,000 customer names to be cut off in the CRM.

The issue was discovered 2 weeks later when a customer complained that their name appeared as "Christopher Alexande" in all communications.

Fix: Added schema validation at ingestion: column types and lengths are checked against a contract. Any schema change triggers an alert before data is loaded.

War Story

The Duplicate Order Disaster

A payment service had an idempotency bug that caused 0.3% of orders to be duplicated. The data pipeline loaded these duplicates into the warehouse. Total revenue reported was $45M instead of $44.85M — a $150K discrepancy.

The error compounded: marketing budgets were set based on the inflated revenue, leading to overspending. The sales team received incorrect commissions. Month-end reconciliation caught it, but by then the damage was done.

Fix: Added a uniqueness check on (order_id) as a BLOCKING test — the pipeline fails if duplicates are detected. Added a daily reconciliation between the warehouse and the payment system's transaction log.

Decision Framework

Quality Check Priority

Check TypePriorityWhen to Implement
NOT NULL on PKsCriticalDay 1
UNIQUE on PKsCriticalDay 1
Row count boundsHighWeek 1
Referential integrityHighWeek 1
FreshnessHighWeek 1
Value range checksMediumMonth 1
Distribution checksMediumMonth 1
Cross-system reconciliationMediumQuarter 1
ML anomaly detectionLowYear 1

Advanced Topics

Data Quality Scoring

Assign a composite quality score to each dataset:

Quality Score=i=1nwi×si

Where wi is the weight and si is the score (0 or 1) for each check.

Data Observability

Beyond testing, continuous monitoring of data health:

typescript
interface DataObservabilityMetrics {
  freshness: Date;
  volume: { rowCount: number; byteSize: number };
  schema: { columnCount: number; typeHash: string };
  distribution: Record<string, { mean: number; stddev: number; nullPct: number }>;
  lineage: { upstreamTables: string[]; downstreamTables: string[] };
}

Cross-References


Key Takeaway

  • Data quality checks should be embedded in pipelines as gates between processing stages -- bad data caught early costs orders of magnitude less to fix than bad data caught downstream.
  • Use a layered approach: schema validation (shape), statistical checks (distribution), business rules (logic), and freshness checks (timeliness).
  • Data contracts formalize quality expectations between producers and consumers, shifting responsibility for data quality to the teams that produce it.
Exercise

Design Data Quality Gates for a Medallion Architecture

You are implementing Bronze/Silver/Gold layers for a customer analytics pipeline. Design data quality checks for each transition:

  1. Bronze -> Silver gate: What checks prevent bad data from entering the Silver layer?
  2. Silver -> Gold gate: What checks ensure business-ready data quality?
  3. Gold -> Dashboard gate: What checks prevent wrong numbers on executive dashboards?

For each gate, specify: the check type, threshold, and action on failure (block, warn, or quarantine).

Solution

Bronze -> Silver Gate:

  • Schema validation: all required columns present, correct types. Action: BLOCK (reject entire batch).
  • Null check: customer_id NOT NULL. Action: QUARANTINE (route null-ID records to DLQ, pass valid records).
  • Dedup: no duplicate customer_id per batch. Action: AUTO-FIX (keep most recent by updated_at).
  • Freshness: data timestamp within 24 hours of current time. Action: WARN (alert data engineering team).

Silver -> Gold Gate:

  • Referential integrity: all customer_id in fact table exist in dim_customer. Action: BLOCK (missing dimensions must be resolved first).
  • Statistical checks: row count within 2 standard deviations of 30-day rolling average. Action: WARN (alert if volume anomaly).
  • Business rules: lifetime_value >= 0, signup_date <= current_date. Action: QUARANTINE (route violations to review).
  • Completeness: key metrics (revenue, count) not null for any required segment. Action: BLOCK.

Gold -> Dashboard Gate:

  • Cross-check: total revenue matches finance system to within 0.1%. Action: BLOCK (do not publish to dashboard until reconciled).
  • Trend check: no metric changes more than 50% day-over-day without explanation. Action: WARN (flag for manual review).
  • Timeliness: Gold tables refreshed within SLA (by 8 AM UTC). Action: ALERT (escalate to PagerDuty if SLA breached).

Common Misconceptions

  • "Data quality is the data engineering team's responsibility." Data quality is a shared responsibility. Producers should validate at the source (data contracts), engineering should validate in the pipeline, and consumers should validate at the point of use.
  • "100% data quality is achievable." Real-world data is messy. The goal is to detect and handle issues, not prevent them entirely. Design for graceful degradation, not perfection.
  • "Schema validation is sufficient for data quality." Data can have the right schema but wrong values (negative ages, future dates, impossible combinations). Schema validation checks shape; you also need value-level and statistical checks.
  • "Data quality checks should block the pipeline on any failure." Minor quality issues (0.1% null rate) should warn, not block. Blocking thresholds should be calibrated to avoid false positives that delay critical pipelines.
  • "Great Expectations and dbt tests solve the same problem." dbt tests run inside the warehouse on materialized data. Great Expectations runs anywhere in the pipeline (before load, during transform, in CI/CD). They are complementary.

In Production

  • Airbnb uses a data quality score (0-100) for every dataset, computed from completeness, accuracy, freshness, and consistency checks. Datasets below 80 are flagged for remediation.
  • Uber implements data contracts between producer and consumer teams: the producing team defines and enforces quality SLAs, and any breaking change requires consumer team sign-off.
  • Netflix runs statistical anomaly detection on key metrics (viewing hours, signup counts) that automatically flags and quarantines data batches that deviate significantly from historical patterns.
  • LinkedIn built their own data quality framework that runs checks at every layer of their medallion architecture, with automated escalation paths based on severity and business impact.
Quiz

1. What is the cost multiplier for fixing data quality issues at each stage?

A) Cost is the same at every stage B) Roughly 10x per stage downstream -- fixing at source costs $1, at ingestion costs $10, at warehouse costs $100, at dashboard costs $1000 C) Cost decreases as data moves downstream D) Cost only matters at the final stage

Answer

B) Fixing bad data at the source is cheapest. Once bad data flows into the warehouse, it contaminates downstream tables. By the time it reaches dashboards and decisions, the cost includes reprocessing, retracting reports, and restoring trust.

2. What is a data contract?

A) A legal agreement about data ownership B) A formal specification between data producer and consumer defining schema, quality standards, SLAs, and change management procedures C) A database access control policy D) A data encryption standard

Answer

B) A data contract defines what the producer promises (schema, freshness, completeness) and what the consumer expects. It includes SLAs (99.9% completeness), change management (30-day notice for breaking changes), and enforcement (automated validation).

3. When should a data quality check BLOCK the pipeline vs WARN?

A) Always block -- data quality is critical B) Block when the issue affects data correctness for ALL downstream consumers; warn when the issue is minor or affects only some consumers C) Never block -- warnings are sufficient D) Block only in production, warn in development

Answer

B) Blocking thresholds should reflect business impact. A null primary key (data correctness issue) should block. A 0.1% increase in null optional fields should warn. Calibrate thresholds to avoid false positives that delay critical pipelines.

4. What is statistical anomaly detection in data quality?

A) Using machine learning to clean data B) Comparing current data metrics (row count, null rate, value distribution) against historical baselines and flagging significant deviations C) Running statistical tests on SQL queries D) Detecting duplicate records

Answer

B) Statistical checks compare current metrics against rolling averages (e.g., 30-day). If today's row count is 3 standard deviations below the average, it is flagged as an anomaly. This catches volume drops, missing sources, and upstream failures automatically.

5. How do Great Expectations and dbt tests complement each other?

A) They are redundant and you should choose one B) Great Expectations validates data anywhere in the pipeline (before load, in Python, in CI/CD); dbt tests validate materialized data inside the warehouse C) Great Expectations is for streaming; dbt tests are for batch D) Great Expectations replaces dbt tests entirely

Answer

B) Great Expectations can validate data in Python scripts, Spark jobs, or CI/CD pipelines before data enters the warehouse. dbt tests validate data after it is materialized in the warehouse (unique, not_null, relationships). Using both provides defense in depth.

:::


One-Liner Summary: Embed data quality checks as gates between pipeline layers -- catch issues at ingestion for $1 instead of at the executive dashboard for $1000.

"What I cannot create, I do not understand." — Richard Feynman