Medallion Architecture

What Is Medallion Architecture?

Medallion architecture organizes data in a lakehouse into three layers of increasing quality and business value: Bronze (raw), Silver (cleaned), and Gold (business-ready). Each layer applies progressively stricter data quality transformations, creating a clear lineage from raw source data to trustworthy analytics.

The metaphor is intentional: bronze is unrefined, silver is processed, gold is the finished product. Data flows in one direction — from Bronze to Silver to Gold — and each transition enforces data quality gates that prevent bad data from propagating downstream.

Why Medallion Architecture Exists

Without a layered approach, data pipelines become tangled. Raw data feeds directly into dashboards. Analysts write complex queries that handle data cleaning, deduplication, and business logic all at once. When something breaks, nobody knows if the issue is in the source data, the cleaning logic, or the business rules.

Medallion architecture provides:

Debuggability — Bronze retains the raw data exactly as it arrived, so you can always go back and understand what happened
Reprocessability — if Silver or Gold logic changes, you reprocess from Bronze without re-ingesting from sources
Separation of concerns — ingestion, cleaning, and business logic are separate, independently testable pipelines
Progressive quality — each layer guarantees a higher level of data quality
Multiple consumers — ML teams read from Silver (they want clean but flexible data), BI teams read from Gold (they want pre-aggregated, fast data)

Bronze Layer: Raw Ingestion

Purpose

Bronze is the immutable landing zone for all source data. Data arrives in its original format and structure, with minimal transformation. The only guarantees at the Bronze level are: the data was received, it is stored durably, and ingestion metadata (timestamp, source, batch ID) is attached.

Design Principles

Principle	Rationale
Append-only	Never update or delete Bronze data — it is the system of record
Schema-on-read	Accept any schema; enforce nothing (use `STRING` or `VARIANT` for unknown structures)
Add metadata	Every record gets `_ingested_at`, `_source_system`, `_batch_id`
Partition by ingestion time	Not by business time (you don't trust it yet)
Retain everything	Even "bad" records — they are evidence for debugging

Implementation

python

from pyspark.sql import functions as F
from datetime import datetime

# Bronze ingestion: CDC events from Kafka
raw_df = (
    spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker:9092")
    .option("subscribe", "orders.cdc")
    .option("startingOffsets", "earliest")
    .load()
)

# Minimal transformation: parse Kafka envelope, add metadata
bronze_df = (
    raw_df
    .select(
        F.col("key").cast("string").alias("_kafka_key"),
        F.col("value").cast("string").alias("raw_payload"),
        F.col("topic").alias("_source_topic"),
        F.col("partition").alias("_kafka_partition"),
        F.col("offset").alias("_kafka_offset"),
        F.col("timestamp").alias("_kafka_timestamp"),
        F.lit(datetime.utcnow().isoformat()).alias("_ingested_at"),
        F.lit("orders-cdc-pipeline").alias("_pipeline_name"),
    )
)

# Write to Bronze Delta table, partitioned by ingestion date
(
    bronze_df.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "s3://lake/checkpoints/orders_bronze")
    .partitionBy("_ingested_at_date")
    .toTable("bronze.orders_raw")
)

Store the Raw Payload as a String

It is tempting to parse JSON into typed columns at the Bronze layer. Do not. Store the raw payload as a STRING or BINARY column. Schema inference at ingestion time is fragile — if the source schema changes, your Bronze ingestion breaks. Parse it in Silver where you control the schema evolution.

Bronze Anti-Patterns

Anti-Pattern	Why It Is Wrong	What to Do Instead
Parsing/transforming at Bronze	Couples ingestion to source schema	Store raw payload, parse in Silver
Deduplicating at Bronze	You lose evidence of duplicates	Deduplicate in Silver
Filtering "bad" records	You lose debugging information	Keep everything, flag bad records in Silver
Partitioning by business date	Source timestamps may be wrong	Partition by ingestion time
Overwriting Bronze data	Destroys audit trail	Append only

Silver Layer: Cleaned and Conformed

Purpose

Silver is where raw data becomes trustworthy. Transformations in the Silver layer focus on structural quality: parsing, deduplication, type casting, null handling, and schema conformance. After the Silver layer, every record has a consistent schema and passes basic quality checks.

Design Principles

Principle	Rationale
Schema-on-write	Enforce a typed schema — reject or quarantine records that don't conform
Deduplicate	Use record keys + event timestamps to keep only the latest version of each record
Conform	Standardize naming conventions, time zones, and data types across sources
Track lineage	Preserve `_bronze_id` or `_source_file` so every Silver record traces back to Bronze
Quality gates	Run checks after each batch; quarantine rows that fail

Quality Gate 1: Bronze to Silver

Implementation

python

from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType, TimestampType
from delta.tables import DeltaTable

# Define expected Silver schema
silver_schema = StructType([
    StructField("order_id", LongType(), False),
    StructField("customer_id", LongType(), False),
    StructField("product_id", LongType(), False),
    StructField("quantity", LongType(), False),
    StructField("amount", DoubleType(), False),
    StructField("currency", StringType(), False),
    StructField("order_status", StringType(), False),
    StructField("ordered_at", TimestampType(), False),
    StructField("updated_at", TimestampType(), False),
])

# Read incremental Bronze data (only new records since last run)
bronze_incremental = (
    spark.readStream
    .format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", last_processed_version)
    .table("bronze.orders_raw")
)

# Parse raw payload
parsed_df = (
    bronze_incremental
    .select(
        F.from_json(F.col("raw_payload"), silver_schema).alias("data"),
        F.col("_kafka_offset").alias("_bronze_offset"),
        F.col("_ingested_at").alias("_bronze_ingested_at"),
    )
    .select("data.*", "_bronze_offset", "_bronze_ingested_at")
)

# Validation: flag bad records
validated_df = parsed_df.withColumn(
    "_is_valid",
    (F.col("order_id").isNotNull()) &
    (F.col("amount") > 0) &
    (F.col("currency").isin("USD", "EUR", "GBP"))
)

# Route: valid records to Silver, invalid to quarantine
valid_df = validated_df.filter(F.col("_is_valid"))
quarantine_df = validated_df.filter(~F.col("_is_valid"))

# Upsert into Silver (merge on primary key)
def upsert_to_silver(batch_df, batch_id):
    silver_table = DeltaTable.forName(spark, "silver.orders")
    (
        silver_table.alias("target")
        .merge(
            batch_df.alias("source"),
            "target.order_id = source.order_id"
        )
        .whenMatchedUpdate(
            condition="source.updated_at > target.updated_at",
            set={"*": "source.*"}
        )
        .whenNotMatchedInsertAll()
        .execute()
    )

valid_df.writeStream \
    .foreachBatch(upsert_to_silver) \
    .option("checkpointLocation", "s3://lake/checkpoints/orders_silver") \
    .start()

Quarantine Table

Records that fail validation are not discarded. They are written to a quarantine table with the failure reason, enabling data engineers to investigate and fix upstream issues.

python

quarantine_with_reason = quarantine_df.withColumn(
    "_failure_reason",
    F.when(F.col("order_id").isNull(), "missing_order_id")
     .when(F.col("amount") <= 0, "invalid_amount")
     .when(~F.col("currency").isin("USD", "EUR", "GBP"), "unsupported_currency")
     .otherwise("unknown")
)

quarantine_with_reason.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://lake/checkpoints/orders_quarantine") \
    .toTable("quarantine.orders")

The Quarantine Table Is Not Optional

In production, 1-5% of records fail validation on any given day. Without a quarantine table, these records silently disappear. You only discover the issue weeks later when a finance report does not reconcile. Always implement quarantine.

Gold Layer: Business-Ready

Purpose

Gold tables are curated, business-logic-enriched datasets optimized for specific consumption patterns. They are pre-aggregated, pre-joined, and denormalized for fast queries. A Gold table answers a specific business question directly — no further joins or transformations needed.

Design Principles

Principle	Rationale
Business-aligned	Each Gold table serves a specific business domain or dashboard
Pre-aggregated	Metrics are pre-computed (daily revenue, monthly active users)
Denormalized	Join dimension tables into facts — consumers should not need multi-table joins
SLA-governed	Gold tables have freshness SLAs (e.g., "updated within 1 hour of source change")
Access-controlled	Fine-grained permissions — not everyone sees all Gold tables

Quality Gate 2: Silver to Gold

python

# Gold table: daily order summary by product category
from pyspark.sql import functions as F

# Read from Silver
orders = spark.table("silver.orders")
products = spark.table("silver.products")
customers = spark.table("silver.customers")

# Business logic: join, aggregate, enrich
daily_order_summary = (
    orders
    .join(products, "product_id")
    .join(customers, "customer_id")
    .groupBy(
        F.date_trunc("day", "ordered_at").alias("order_date"),
        products.category.alias("product_category"),
        customers.region.alias("customer_region"),
    )
    .agg(
        F.count("order_id").alias("total_orders"),
        F.sum("amount").alias("total_revenue"),
        F.avg("amount").alias("avg_order_value"),
        F.countDistinct("customer_id").alias("unique_customers"),
    )
    .withColumn("_computed_at", F.current_timestamp())
)

# Overwrite partition (idempotent)
(
    daily_order_summary.write
    .format("delta")
    .mode("overwrite")
    .option("replaceWhere", f"order_date = '{target_date}'")
    .saveAsTable("gold.daily_order_summary")
)

Gold Table Patterns

Pattern	Description	Example
Aggregated metrics	Pre-computed measures at a fixed granularity	Daily revenue by region
Wide denormalized	Fact + all dimensions joined into a single table	Order with customer, product, and shipping details
Feature store	ML features pre-computed and versioned	User purchase frequency, avg session duration
Snapshot	Point-in-time state of an entity	Customer profile as of end of day
SCD (Slowly Changing Dimension)	Historical tracking of dimension changes	Customer address history with valid_from/valid_to

Data Quality Framework

Automated Checks at Each Layer

Great Expectations Integration

python

import great_expectations as gx

context = gx.get_context()

# Define expectation suite for Silver orders
suite = context.add_expectation_suite("silver_orders_quality")

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="amount", min_value=0.01, max_value=1000000
    )
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="currency", value_set=["USD", "EUR", "GBP"]
    )
)

# Run validation
results = context.run_validation(
    batch=silver_batch,
    expectation_suite=suite
)

if not results.success:
    # Alert and halt pipeline
    raise DataQualityError(f"Silver validation failed: {results.statistics}")

Naming Conventions

Consistent naming is critical when you have hundreds of tables. A common convention:

{layer}.{domain}_{entity}_{qualifier}

Examples:
  bronze.orders_raw
  bronze.payments_raw
  silver.orders                 (cleaned, deduplicated)
  silver.customers              (conformed from multiple sources)
  gold.daily_order_summary      (aggregated metric)
  gold.customer_360             (wide denormalized)
  quarantine.orders             (failed validation)