Skip to content

Analytics Pipeline

What This Pipeline Does

This section documents a complete, production-grade analytics pipeline capable of ingesting billions of events per day, storing them efficiently, and serving complex analytical queries in milliseconds.

The pipeline covers:

  • Event collection — HTTP endpoints, SDK design, server-side event emission
  • Streaming transport — Kafka for durability, backpressure, and replayability
  • Columnar storage — ClickHouse for sub-second analytical queries
  • Query layer — SQL functions, materialized views, and GraphQL API

Why Not Just Use PostgreSQL?

PostgreSQL is excellent for transactional data. It fails for analytics at scale because:

CharacteristicOLTP (PostgreSQL)OLAP (ClickHouse)
WorkloadMany small reads/writesFew large reads
StorageRow-orientedColumn-oriented
Query typePoint lookups, joinsAggregations, scans
Index strategyB-tree on individual rowsColumn-level compression
1B row query100s of seconds< 1 second
Compression1x (row store)10–50x (column store)

For a billion-event analytics table, ClickHouse stores it in ~10 GB. PostgreSQL would need ~500 GB for the same data with no indexes, or ~2TB with indexes.

Architecture Diagram

Section Contents

PageTopics
ArchitectureFull platform design, component decisions, data flow
Event SchemaEvent taxonomy, TypeScript types, schema registry
IngestionHTTP collection endpoint, client SDK, server-side events
ClickHouse StorageTables, ReplicatedMergeTree, materialized views, TTL
Query EngineFunnel SQL, retention analysis, caching, GraphQL API

Scale Targets

This architecture is designed for:

MetricTarget
Ingestion rate100K events/second peak
Event size1 KB average
Retention2 years raw, 5 years aggregated
Query latencyP50 < 100ms, P99 < 2s
Dashboard freshness< 5 minutes
API uptime99.9%

These targets reflect a mid-size SaaS company (1–10M monthly active users). Adjust based on your scale.

Technology Choices

ComponentChoiceAlternativesWhy
Event transportApache KafkaKinesis, PulsarHighest throughput, battle-tested
OLAP storeClickHouseBigQuery, Snowflake, DruidSelf-hosted, fastest queries, best cost
Stream processingKafka StreamsFlink, Spark StreamingLow ops overhead
Query layerTypeScript + SQLdbt, MetabaseFull control over API
Object storageS3GCS, Azure BlobIndustry standard

TIP

If you're just starting and don't have billions of events yet, consider starting with PostgreSQL + TimescaleDB extension. Migrate to ClickHouse when queries start taking > 5 seconds. The event schema and ingestion pipeline remain the same either way.

"What I cannot create, I do not understand." — Richard Feynman