Stream Processing

Stream processing is the continuous computation over unbounded data — data that has no defined beginning or end. Unlike batch processing where you know the full dataset upfront, stream processing must handle data as it arrives, making decisions about completeness, ordering, and correctness in real time.

Why Stream Processing

The business case for stream processing grows as the cost of data latency increases:

Use Case	Latency Tolerance	Why Streaming
Fraud detection	< 1 second	Block fraudulent transactions before they complete
Real-time dashboards	< 30 seconds	Operational visibility for live systems
Recommendation engines	< 5 seconds	Personalize based on current session behavior
IoT monitoring	< 10 seconds	Detect equipment anomalies before failure
Event-driven microservices	< 1 second	React to state changes across services
Real-time bidding (AdTech)	< 100 ms	Bid on ad impressions in auction windows

Core Concepts

Event Time vs Processing Time

The most fundamental concept in stream processing. Every event has two timestamps:

Event time: When the event actually occurred (embedded in the data)
Processing time: When the event is processed by the system (wall clock)

Event created at source:     2026-03-17T14:00:00Z  (event time)
Event arrives at Kafka:      2026-03-17T14:00:05Z  (5s network delay)
Event processed by Flink:    2026-03-17T14:00:12Z  (processing time)
                                                    12s total latency

Why this matters: If you window by processing time, late-arriving events fall into the wrong window. Event time gives correct results but requires watermarks to determine completeness.

python

# Flink: Setting event time semantics
env = StreamExecutionEnvironment.get_execution_environment()

# Assign timestamps and watermarks from event data
class EventTimestampAssigner(TimestampAssignerWithPeriodicWatermarks):
    def extract_timestamp(self, event, previous_timestamp):
        return event['event_timestamp_ms']  # Use event time

    def get_current_watermark(self):
        # Allow 10 seconds of out-of-orderness
        return Watermark(self.current_max_timestamp - 10_000)

The Streaming Data Model

                    Unbounded Input
                    ──────────────▶
Events:  [e1] [e2] [e3] [e4] [e5] [e6] [e7] ...

                    ┌────────────────────┐
                    │  Stream Processor   │
                    │                     │
                    │  - Filter           │
                    │  - Map              │
                    │  - Window           │
                    │  - Aggregate        │
                    │  - Join             │
                    │  - State mgmt       │
                    └────────────────────┘

                    Unbounded Output
                    ──────────────▶
Results: [r1]    [r2]    [r3]    [r4]    ...

Technology Landscape

Technology	Strengths	Weaknesses	Best For
Apache Flink	True streaming, advanced windowing, exactly-once	Complex operations, JVM-heavy	Complex event processing, stateful streaming
Kafka Streams	Lightweight, embedded in app, Kafka-native	Limited to Kafka ecosystem	Microservice event processing
Apache Spark Structured Streaming	Unified batch+stream, Python support	Micro-batch (not true streaming), latency	Teams already using Spark
Apache Beam	Portability across runners	Abstraction overhead	Multi-cloud, runner flexibility
Amazon Kinesis	Managed, AWS-native	AWS lock-in, limited features	Simple AWS streaming pipelines
Flink SQL	SQL interface for streaming	Limited for complex logic	Stream analytics for SQL-proficient teams

Section Contents

This section covers stream processing in depth:

Windowing — Tumbling, sliding, session, and global windows
Watermarks — Tracking event-time progress and completeness
Exactly-Once Processing — Achieving end-to-end exactly-once semantics
State Management — Keyed state, state backends, checkpointing
Backpressure — Handling producers faster than consumers