Skip to content

Real-Time Analytics Stack

Traditional analytics relies on batch ETL — events land in a data lake overnight, get transformed, and appear in dashboards the next morning. Real-time analytics eliminates this delay. Events flow from your application through a streaming layer, into a purpose-built OLAP engine, and are queryable within seconds. The dashboard shows what is happening now, not what happened yesterday.

This page covers the architecture of a real-time analytics stack, OLAP engine design principles, stream-to-OLAP ingestion patterns, and materialized views. For head-to-head engine comparisons, see the companion ClickHouse vs Druid vs Pinot page.


Why Real-Time Analytics?

Use CaseBatch LatencyReal-Time LatencyBusiness Impact of Delay
Fraud detection dashboardsHoursSecondsMissed fraud = financial loss
Ad campaign performanceNext dayMinutesWasted ad spend
Infrastructure monitoring5-15 minSecondsMissed outages
E-commerce conversion funnelsHoursSecondsCannot react to drop in conversions
Gaming leaderboardsMinutesSub-secondBad player experience
IoT fleet monitoringHoursSecondsEquipment failure
Social media trending topics30 minSecondsMissed trending content

TIP

Real-time analytics does not replace batch analytics — it complements it. Batch pipelines handle complex joins, historical reprocessing, and data quality. Real-time handles operational dashboards, alerting, and time-sensitive decisions. Most mature data platforms run both.


Architecture Overview

The Three Ingestion Patterns

PatternFlowWhen to Use
Direct ingestionKafka → OLAP engineEngine has native Kafka connector (ClickHouse, Pinot, Druid)
Stream processing firstKafka → Flink → OLAPNeed enrichment, joins, or aggregation before storage
Dual writeApp writes to both OLTP and OLAPSimple setups, low volume (not recommended at scale)

DANGER

Never dual-write from your application to both OLTP and OLAP. This creates consistency issues — if one write succeeds and the other fails, your analytics diverge from your source of truth. Use CDC (Change Data Capture) from the OLTP database or write to Kafka and fan out from there.


OLAP Engine Design Principles

Real-time OLAP engines are fundamentally different from row-oriented OLTP databases. Understanding these design choices explains why ClickHouse can scan billions of rows in seconds.

Column-Oriented Storage

PropertyRow StoreColumn Store
Best forPoint lookups, CRUDAnalytical queries (aggregation)
Scan efficiencyReads all columns even if query needs oneReads only queried columns
CompressionLow (mixed types per row)High (same type per column, 10-20x)
Insert patternSingle-row insertsBatch inserts (append-only)
Update/DeleteFast (in-place)Slow or unsupported (immutable segments)
Query patternWHERE id = 123SELECT count(*), avg(amount) WHERE date > X

Key OLAP Design Choices


Stream-to-OLAP Pipeline

Direct Kafka Ingestion

Most OLAP engines can consume directly from Kafka:

sql
-- ClickHouse: Kafka engine table
CREATE TABLE events_queue (
    event_id String,
    user_id UInt64,
    event_type LowCardinality(String),
    properties String,          -- JSON
    timestamp DateTime64(3)
) ENGINE = Kafka
SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'clickhouse_consumer',
    kafka_format = 'JSONEachRow',
    kafka_num_consumers = 4;

-- Materialized view to transform and insert into final table
CREATE MATERIALIZED VIEW events_mv TO events AS
SELECT
    event_id,
    user_id,
    event_type,
    JSONExtractString(properties, 'page') AS page,
    JSONExtractFloat(properties, 'duration') AS duration_ms,
    timestamp,
    toDate(timestamp) AS event_date
FROM events_queue;

-- Final analytics table
CREATE TABLE events (
    event_id String,
    user_id UInt64,
    event_type LowCardinality(String),
    page LowCardinality(String),
    duration_ms Float32,
    timestamp DateTime64(3),
    event_date Date
) ENGINE = MergeTree()
PARTITION BY event_date
ORDER BY (event_type, user_id, timestamp)
TTL event_date + INTERVAL 90 DAY;

When you need to join streaming events with reference data before loading:

python
# Flink SQL: enrich events with user data before loading into OLAP
"""
CREATE TABLE events_raw (
    event_id STRING,
    user_id BIGINT,
    event_type STRING,
    properties STRING,
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
    'connector' = 'kafka',
    'topic' = 'events',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'json'
);

CREATE TABLE users_dim (
    user_id BIGINT,
    country STRING,
    plan_type STRING,
    signup_date DATE
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:postgresql://pg:5432/app',
    'table-name' = 'users',
    'lookup.cache.ttl' = '1 hour'
);

-- Enriched output to ClickHouse
CREATE TABLE events_enriched (
    event_id STRING,
    user_id BIGINT,
    event_type STRING,
    country STRING,
    plan_type STRING,
    event_time TIMESTAMP(3)
) WITH (
    'connector' = 'clickhouse',
    'url' = 'clickhouse://ch:8123',
    'table-name' = 'events_enriched'
);

INSERT INTO events_enriched
SELECT
    e.event_id,
    e.user_id,
    e.event_type,
    u.country,
    u.plan_type,
    e.event_time
FROM events_raw e
JOIN users_dim FOR SYSTEM_TIME AS OF e.event_time AS u
    ON e.user_id = u.user_id;
"""

Materialized Views for Real-Time Aggregation

Materialized views pre-compute aggregations at ingestion time, making queries instant instead of scanning raw data.

The Aggregation Pipeline

ClickHouse Materialized View Example

sql
-- Raw events table
CREATE TABLE page_views (
    timestamp DateTime64(3),
    user_id UInt64,
    page LowCardinality(String),
    country LowCardinality(String),
    device LowCardinality(String),
    duration_ms UInt32
) ENGINE = MergeTree()
PARTITION BY toDate(timestamp)
ORDER BY (page, country, timestamp)
TTL toDate(timestamp) + INTERVAL 7 DAY;

-- Pre-aggregated per-minute rollup
CREATE MATERIALIZED VIEW page_views_per_minute
ENGINE = SummingMergeTree()
PARTITION BY toDate(minute)
ORDER BY (page, country, device, minute)
AS SELECT
    toStartOfMinute(timestamp) AS minute,
    page,
    country,
    device,
    count() AS view_count,
    uniqHLL12(user_id) AS unique_users,
    sum(duration_ms) AS total_duration_ms,
    avg(duration_ms) AS avg_duration_ms
FROM page_views
GROUP BY minute, page, country, device;

-- Query the rollup instead of raw data
-- This scans thousands of rows instead of billions
SELECT
    page,
    sum(view_count) AS views,
    uniqMerge(unique_users) AS uniques,
    sum(total_duration_ms) / sum(view_count) AS avg_duration
FROM page_views_per_minute
WHERE minute >= now() - INTERVAL 1 HOUR
GROUP BY page
ORDER BY views DESC
LIMIT 20;

TIP

The key insight with materialized views is that you trade storage for query speed. A 5-minute rollup reduces data volume by 300x (from per-event to per-5-minute-bucket), making dashboard queries instant. Store both raw and rollup tables — raw for drill-down, rollups for dashboards.


Query Patterns for Real-Time Dashboards

Common Query Types

Query TypeExampleOptimization
Time-series aggregationPage views per minute, last 1hMaterialized view, time-sorted partitions
Top-NTop 10 pages by viewsPre-aggregated rollup + ORDER BY LIMIT
Distinct countUnique users in last 24hHyperLogLog (approximate, O(1) merge)
PercentilesP95 latency per endpointt-digest or quantile sketches
Funnel analysisSignup → Activation → PurchaseWindow functions on user event sequences
RetentionDay-1, Day-7, Day-30 retentionBitmap intersection across date cohorts

Approximate Algorithms

Exact aggregations over billions of rows are slow. OLAP engines use probabilistic data structures for speed:

AlgorithmUse CaseErrorSpeed vs Exact
HyperLogLogCOUNT DISTINCT~2%1000x faster
t-digestPercentiles (P50, P95, P99)~1%100x faster
Count-Min SketchFrequency estimationOverestimates by epsilon10x faster
Theta SketchSet operations (union, intersection)~2%100x faster
Bloom FilterMembership testingFalse positives onlyConstant time
sql
-- ClickHouse: HyperLogLog for approximate unique users
SELECT
    toStartOfHour(timestamp) AS hour,
    uniqHLL12(user_id) AS approx_unique_users
FROM events
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;

-- Exact count for comparison (much slower)
SELECT
    toStartOfHour(timestamp) AS hour,
    uniqExact(user_id) AS exact_unique_users
FROM events
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour;

Lambda vs Kappa Architecture

Two competing approaches for combining batch and real-time:

Lambda Architecture

Kappa Architecture

AspectLambdaKappa
ComplexityHigh (two codepaths)Low (one codepath)
CorrectnessBatch layer is source of truthStream reprocessing is source of truth
ReprocessingEasy (rerun batch job)Requires Kafka retention + replay
CostHigher (dual infrastructure)Lower
MaturityBattle-testedRequires mature streaming infrastructure
Use whenComplex joins, ML training needs batchEvent-driven, append-only analytics

TIP

Start with Kappa if your analytics are primarily event-driven and append-only. Graduate to Lambda only if you need complex historical reprocessing, ML feature computation, or your streaming pipeline cannot guarantee correctness (e.g., late-arriving data beyond Kafka retention).


Operational Concerns

Data Freshness Monitoring

python
class FreshnessMonitor:
    """Alert when data falls behind real-time."""

    def check_freshness(self, table_name):
        # Query the max timestamp in the OLAP table
        result = self.olap.query(f"""
            SELECT max(timestamp) AS latest
            FROM {table_name}
        """)

        lag_seconds = (datetime.utcnow() - result.latest).total_seconds()

        if lag_seconds > 300:  # 5 minutes
            self.alert(
                severity="warning",
                message=f"{table_name} is {lag_seconds:.0f}s behind real-time"
            )

        if lag_seconds > 900:  # 15 minutes
            self.alert(
                severity="critical",
                message=f"{table_name} is {lag_seconds:.0f}s behind — check ingestion pipeline"
            )

        return lag_seconds

Capacity Planning

DimensionHow to SizeRule of Thumb
Ingestion rateEvents/second x avg event sizeClickHouse: 1M events/s per node
StorageDaily volume x retention x compressionColumn store compresses 5-20x
Query concurrencyDashboard users x refresh rate10-50 concurrent queries typical
MemoryActive partitions + query buffers64-256 GB RAM per node
CPUQuery complexity x concurrency16-64 cores per node

What to Learn Next

TopicLink
ClickHouse vs Druid vs Pinot comparisonOLAP Engine Comparison
Stream processing fundamentalsStream Processing
ETL and ELT pipeline patternsETL Patterns
Kafka and message queuesMessage Queues
Observability and monitoringObservability

Key Takeaways

  1. Column-oriented + compression + vectorized execution — this is why OLAP engines scan billions of rows in seconds
  2. Materialized views are your best friend — pre-aggregate at ingestion time to make dashboard queries instant
  3. Approximate algorithms are essential — HyperLogLog for uniques, t-digest for percentiles, at 1-2% error for 100-1000x speedup
  4. Stream directly into OLAP when possible — most engines have native Kafka connectors; add Flink only when you need enrichment or joins
  5. Start with Kappa, upgrade to Lambda — one codepath is easier to maintain until you genuinely need batch reprocessing
  6. Monitor data freshness — a real-time dashboard showing stale data is worse than no dashboard (it creates false confidence)
  7. Never dual-write — use CDC or Kafka fan-out, not application-level writes to both OLTP and OLAP

"What I cannot create, I do not understand." — Richard Feynman