Data Lakehouse

The Problem That Created Lakehouses

For two decades, the data world lived with a painful split. You had two systems, each good at one thing and terrible at the other.

Data warehouses (Teradata, Redshift, Snowflake, BigQuery) gave you fast SQL queries, schema enforcement, ACID transactions, and governance. But they were expensive, locked to proprietary formats, and terrible at handling unstructured data (logs, images, JSON blobs, ML feature vectors).

Data lakes (HDFS, S3, GCS, ADLS) gave you cheap, scalable storage for any data format. But they had no transactions, no schema enforcement, no consistency guarantees, and querying them was slow and painful. Data lakes became "data swamps" — petabytes of files nobody trusted.

Most organizations ran both. They landed raw data in the lake, transformed it, then loaded it into the warehouse. This two-system architecture created a cascade of problems:

Source Systems → Data Lake (raw) → ETL → Data Warehouse (curated)
                     ↓                          ↓
               ML / Data Science          BI / Analytics

The problems with this split:

Data duplication — the same data exists in two places, doubling storage costs
Staleness — by the time data moves from lake to warehouse, it is already stale
Governance nightmare — two systems means two sets of access controls, two audit logs
No single source of truth — analysts query the warehouse, ML engineers query the lake, and they get different answers
ETL tax — every byte of data must be transformed and moved between systems, creating fragile pipelines

The data lakehouse exists to eliminate this split.

What Is a Data Lakehouse?

A data lakehouse combines the cheap, scalable storage of a data lake with the management features of a data warehouse — ACID transactions, schema enforcement, governance, and fast SQL queries — all on a single copy of the data stored in open formats.

The key insight: you do not need a separate warehouse if you add warehouse-like capabilities directly to the lake. Open table formats (Delta Lake, Iceberg, Hudi) provide the transaction layer, metadata management, and query optimization that lakes previously lacked.

Architecture of a Lakehouse

A lakehouse has four layers:

Layer	Purpose	Technologies
Storage	Cheap, durable, scalable object storage	S3, GCS, ADLS, MinIO
Table Format	ACID transactions, schema enforcement, time travel	Delta Lake, Apache Iceberg, Apache Hudi
Query Engine	Fast SQL and programmatic access	Spark, Trino, DuckDB, Datafusion
Catalog	Metadata management, discovery, governance	Unity Catalog, Hive Metastore, Nessie, Polaris

Storage Layer

Object storage (S3, GCS, ADLS) is the foundation. It costs roughly $0.02/GB/month, scales to exabytes, and stores any file format. The lakehouse treats object storage as its disk, storing data in columnar formats like Parquet or ORC.

Table Format Layer

This is the innovation that makes lakehouses possible. An open table format adds a metadata layer on top of Parquet files that provides:

ACID transactions — concurrent reads and writes don't corrupt data
Schema enforcement — writes that violate the schema are rejected
Time travel — query data as it existed at any point in the past
Partition evolution — change partitioning without rewriting data
File-level statistics — skip irrelevant files during queries (data skipping)

The Table Format Is Just Metadata

A common misconception is that Delta Lake, Iceberg, or Hudi are new file formats. They are not. Your data is still stored as Parquet files. The table format adds a metadata layer (JSON files, Avro manifests, or a transaction log) that tracks which Parquet files belong to the table, their schema, statistics, and transaction history.

See Open Table Formats for a deep comparison of Delta Lake, Iceberg, and Hudi.

Query Engine Layer

Multiple engines can read the same lakehouse table concurrently because the table format is open. Spark processes batch ETL, Trino runs interactive SQL, DuckDB handles local analytics, and ML frameworks read directly from the same Parquet files.

See Query Engines for a comparison of Spark, Trino, DuckDB, and Datafusion.

Catalog Layer

The catalog is the registry of all tables — their schemas, locations, access policies, and lineage. Without a catalog, the lakehouse is just a pile of files. Modern catalogs like Unity Catalog and Project Nessie add Git-like branching, fine-grained access control, and cross-engine interoperability.

Data Lake vs Data Warehouse vs Data Lakehouse

Dimension	Data Lake	Data Warehouse	Data Lakehouse
Storage cost	Low ($0.02/GB/mo)	High ($12-25/TB/mo)	Low ($0.02/GB/mo)
Data types	Structured, semi-structured, unstructured	Structured only	All types
ACID transactions	No	Yes	Yes
Schema enforcement	Schema-on-read (none)	Schema-on-write (strict)	Both supported
Query performance	Poor without optimization	Excellent	Good to excellent
Time travel	No	Limited	Yes (full history)
Governance	Weak	Strong	Strong
ML support	Native (direct file access)	Poor (data must be exported)	Native
Open format	Yes (Parquet, ORC)	No (proprietary)	Yes (open table formats)
Concurrent writes	Unsafe	Safe	Safe
Real-time ingestion	Append only, no consistency	Near real-time	Streaming + ACID
Vendor lock-in	Low	High	Low

Data Lakes Are Not Dead

The lakehouse does not replace the raw data lake — it sits on top of it. You still land raw data into object storage. The lakehouse adds structure to that raw data through table formats. Think of it as a "managed layer" over your existing lake, not a replacement for it.

Why Lakehouses Emerged

Several technological shifts converged around 2019-2021:

Object storage became the universal substrate — every cloud provider offered cheap, durable object storage with S3-compatible APIs
Parquet became the de facto columnar format — high compression, columnar pruning, and universal reader support
Open table formats matured — Delta Lake (Databricks), Iceberg (Netflix/Apple), and Hudi (Uber) proved ACID on object storage was possible and performant
Separation of compute and storage — decoupled pricing and scaling, making multi-engine access economical
ML demanded direct data access — ML frameworks (PyTorch, TensorFlow) needed to read training data directly from storage, not through a SQL warehouse

The tipping point was the realization that the "warehouse features" people wanted (transactions, schema enforcement, time travel, governance) were properties of the metadata layer, not the storage engine. By moving those features into an open table format, any engine could provide warehouse-like guarantees on lake-priced storage.

Open Table Formats Overview

Three open table formats dominate the lakehouse ecosystem:

Feature	Delta Lake	Apache Iceberg	Apache Hudi
Origin	Databricks (2019)	Netflix (2017)	Uber (2016)
Transaction log	JSON-based log in `_delta_log/`	Avro manifest files with snapshot metadata	Timeline-based in `.hoodie/`
Primary ecosystem	Databricks, Spark	Multi-engine (Spark, Trino, Flink, Dremio)	Spark, Flink
Partition evolution	Requires rewrite	In-place (hidden partitioning)	Limited
Catalog	Unity Catalog	Nessie, Polaris, Hive	Hive
Adoption trend	Strong in Databricks shops	Fastest growing overall	Strong in streaming use cases

Which Format Should You Choose?

If you are a Databricks shop, Delta Lake is the path of least resistance. If you want maximum engine flexibility and are building a multi-engine architecture, Iceberg is the safest bet in 2026. If your primary use case is streaming upserts and CDC, Hudi has the most mature support. See Open Table Formats for the full deep-dive comparison.

Lakehouse Reference Architecture

A production lakehouse typically follows the Medallion Architecture (Bronze/Silver/Gold layers), layered on top of the storage and format layers:

When a Lakehouse Is the Right Choice

Use a lakehouse when:

You have both structured and unstructured data
Multiple teams (analytics, ML, engineering) need to access the same data
You want to avoid vendor lock-in with open formats
You need time travel and audit capabilities
Cost control on storage is important
You are building a multi-engine architecture (Spark + Trino + DuckDB)

Use a traditional warehouse when:

Your data is exclusively structured and SQL is the only access pattern
You need sub-second query latency for dashboards (managed warehouses like Snowflake/BigQuery are still faster for interactive BI)
You have a small data team that wants managed infrastructure
Total data volume is under 10 TB (the complexity overhead of a lakehouse may not be justified)

Use a raw data lake when:

You are only storing data for archival purposes
You do not need query capabilities (data is consumed by downstream pipelines)
Cost is the only consideration

Section Contents

Page	What You Will Learn
Open Table Formats	Delta Lake, Iceberg, Hudi — ACID transactions, time travel, schema evolution, performance features
Medallion Architecture	Bronze/Silver/Gold layers, data quality gates, implementation patterns
Query Engines	Spark, Trino, DuckDB, Datafusion — federated query, pushdown, performance comparison

Data Lakehouse ​

The Problem That Created Lakehouses ​

What Is a Data Lakehouse? ​

Architecture of a Lakehouse ​

Storage Layer ​

Table Format Layer ​

Query Engine Layer ​

Catalog Layer ​

Data Lake vs Data Warehouse vs Data Lakehouse ​

Why Lakehouses Emerged ​

Open Table Formats Overview ​

Lakehouse Reference Architecture ​

When a Lakehouse Is the Right Choice ​

Section Contents ​

Further Reading ​

Related Pages