2026 · 27 min read

Real-Time Data Architecture: From Batch to Streaming at Scale

A comprehensive guide to event-driven systems, stream processing, change data capture, and modern analytics for data-intensive applications

Executive Summary

Data latency is the new competitive battlefield. Organisations that can process, analyse, and act on data in real time deliver superior customer experiences, detect fraud faster, optimise operations dynamically, and make decisions before competitors have even seen the signal.

This whitepaper provides data engineers and architects with a comprehensive guide to real-time data architecture. It covers streaming platforms, event-driven patterns, change data capture, stream processing, data mesh, and real-time analytics - with production-tested patterns and technology selection frameworks.

Key findings:

Organisations with real-time analytics report 3× faster decision-making and 40% improvement in operational KPIs
Change Data Capture (CDC) reduces data integration latency from hours to sub-second without application changes
Event-driven architectures with stream processing handle 10–100× more throughput than equivalent REST-based systems
Data mesh approaches reduce data team bottlenecks by 60% while improving data quality through domain ownership

Who this is for: VP Data Engineering, Data Architects, Principal Engineers, and Analytics Leaders building or modernising data platforms.

Streaming Platforms

Kafka, Pulsar, and Kinesis: The Selection Framework

| Dimension | Apache Kafka | Apache Pulsar | AWS Kinesis | Winner | |-----------|-------------|---------------|-------------|--------| | Throughput | 2M+ msg/sec per broker | 2M+ msg/sec per broker | 1M+ msg/sec per shard | Kafka/Pulsar | | Latency (p99) | 10–50ms | 5–20ms | 100–200ms | Pulsar | | Multi-tenancy | Good | Excellent (built-in) | Good | Pulsar | | Geo-replication | MirrorMaker | Built-in | Cross-region streams | Pulsar | | Ecosystem | Mature (Confluent) | Growing | AWS-native | Kafka | | Operational complexity | Medium | Medium-High | Low | Kinesis | | Cost (self-managed) | Infrastructure only | Infrastructure only | Per-shard pricing | Kafka/Pulsar |

Decision matrix:

AWS-only, low operational overhead: Kinesis
Maximum throughput, mature ecosystem, on-prem or multi-cloud: Kafka (Confluent Cloud or self-managed)
Multi-tenancy, geo-replication, unified queuing and streaming: Pulsar

Kafka Operational Patterns

Sizing:

3 brokers minimum for production (5 for critical workloads)
Plan for 2× current throughput for 12-month growth
SSD storage with 1:1 replication factor for hot data
Separate clusters for ingestion, processing, and analytics if > 500K msg/sec

Partitioning strategy:

Partition by entity ID (user_id, order_id) for ordering guarantees
Avoid default partitioning (round-robin) when ordering matters
Plan for partition count growth; rebalancing is expensive

Data retention:

Hot data: 7 days on Kafka, queried via ksqlDB or consumer applications
Warm data: 30–90 days in object storage (S3 with Kafka Tiered Storage)
Cold data: Indefinite in data lake (Parquet on S3 with Athena/Trino)

Event-Driven Architecture

Event Schema Design

Event schemas are contracts. Breaking them breaks consumers.

Schema evolution rules:

Use Avro or Protobuf with a schema registry (Confluent Schema Registry)
Fields may be added (with defaults); fields may be deprecated but never removed
Version schemas explicitly; enforce compatibility checks in CI/CD
Include event metadata: event_id, event_type, timestamp, source, correlation_id, schema_version

Example event structure:

{
  "event_id": "uuid",
  "event_type": "order.placed",
  "timestamp": "2026-05-28T14:32:00Z",
  "source": "order-service",
  "correlation_id": "uuid",
  "schema_version": "2.1",
  "payload": { ... }
}

The Saga Pattern

For distributed transactions across services, the saga pattern coordinates a sequence of local transactions via events.

Choreography saga: Services react to each other's events. Simple, but hard to trace.

Order service → OrderPlaced event
Payment service → PaymentProcessed event
Inventory service → InventoryReserved event
Shipping service → ShipmentScheduled event

Orchestration saga: A central coordinator (Saga Orchestrator) manages the flow. Better visibility, but introduces a single point of complexity.

Compensating transactions: Each step must have a rollback action. If payment succeeds but inventory fails, the orchestrator triggers a refund.

CQRS and the Outbox Pattern

CQRS (Command Query Responsibility Segregation): Separate read and write models. Writes go to the transactional database; reads are served from optimised projections.

The Outbox Pattern: To atomically update a database and publish an event:

Write event to an "outbox" table in the same transaction as the business update
A separate CDC process reads the outbox table and publishes to Kafka
Mark events as processed to prevent duplicates

This avoids the dual-write problem (database commit succeeds, but Kafka publish fails) without distributed transactions.

Change Data Capture

CDC Mechanisms

| Mechanism | Trigger | Latency | Impact on Source | Tool | |-----------|---------|---------|-----------------|------| | Log-based | Database write-ahead log | < 1 second | Negligible | Debezium, Maxwell, native | | Polling | Scheduled queries | Minutes | Moderate (query load) | Custom scripts, Fivetran | | Trigger-based | Database triggers | Seconds | High (transaction overhead) | Custom triggers |

Log-based CDC is the production standard. It reads the database's write-ahead log (WAL) directly, capturing every insert, update, and delete with zero application changes and minimal performance impact.

Debezium in Production

Architecture:

MySQL/PostgreSQL → WAL → Debezium Connector → Kafka → Consumers
                                                    ↓
                                              Schema Registry

Configuration best practices:

Use snapshot.mode=when_needed for initial load; schema_only_recovery for restart
Set max.queue.size and max.batch.size to match Kafka producer throughput
Enable heartbeat for idle tables (prevents connector stall on low-activity databases)
Monitor lag metrics: debezium.source.database.history.kafka.recovery.poll.interval.ms

Operational challenges:

Schema changes (ALTER TABLE) require connector restart with schema evolution handling
Large transactions can cause memory pressure; tune max.queue.size.in.bytes
Tombstone events for DELETEs must be handled by consumers (compaction vs. soft delete)

Stream Processing

Flink vs. Spark Streaming vs. ksqlDB

| Engine | Latency | Throughput | State Management | Use Case | |--------|---------|-----------|-----------------|----------| | Apache Flink | < 100ms | Very high | Rich (stateful operators, TTL) | Complex event processing, fraud detection, real-time ML | | Spark Structured Streaming | Seconds | High | Basic (mapGroupsWithState) | ETL, analytics, windowed aggregations | | ksqlDB | < 1 second | Medium | Light (tables, windows) | Simple aggregations, filtering, joins | | Kafka Streams | < 100ms | High | Medium (state stores) | Application-level stream processing |

Selection guidance:

Complex stateful processing (sessionisation, pattern matching): Flink
Large-scale ETL with some real-time needs: Spark Structured Streaming
Simple SQL-like operations on Kafka streams: ksqlDB
Embedded in application, moderate scale: Kafka Streams

Materialised Views on Streams

A materialised view is a continuously updated query result. In streaming:

Source: Kafka topic (events)
Processor: Flink job or ksqlDB query
Sink: Key-value store (Redis, DynamoDB) or database (PostgreSQL, ClickHouse)
Query API: Low-latency reads from the materialised view

Example: Real-time user activity dashboard

Events: page_view, click, purchase → Kafka
Flink job: aggregate by user, window = 1 minute → user_activity table in Redis
Dashboard API: read from Redis with < 10ms latency

Data Mesh

The Four Principles

1. Domain-Oriented Ownership Data is owned by the domain teams that produce it, not a central data team. The order service owns order data; the payment service owns payment data.

2. Data as a Product Each domain treats its data as a product with:

Clear documentation (schema, semantics, SLAs)
Discoverability (data catalog, metadata registry)
Quality guarantees (freshness, completeness, accuracy)
Accessible interfaces (API, event stream, query endpoint)

3. Self-Serve Data Platform A platform team provides infrastructure for domains to publish, discover, and consume data - without building custom pipelines each time.

4. Federated Governance Global standards (taxonomy, privacy, security) are defined centrally; implementation is domain-owned. Automated policy enforcement (data classification, access control, PII masking) replaces manual review.

Implementation Patterns

| Component | Technology | Purpose | |-----------|-----------|---------| | Data Catalog | DataHub, Collibra, Amundsen | Discoverability, lineage, ownership | | Event Backbone | Kafka, Pulsar | Domain event publishing | | Query Federation | Trino, Presto, Dremio | Cross-domain querying without ETL | | Data Quality | Great Expectations, Soda | Automated quality checks | | Access Control | Apache Ranger, Immuta | Fine-grained, policy-based access |

Real-Time Analytics

ClickHouse, Druid, and Pinot

| Engine | Insertion | Query Latency | Best For | Operational Complexity | |--------|-----------|---------------|----------|----------------------| | ClickHouse | Very fast | < 100ms | Wide range of analytics | Low | | Apache Druid | Fast | < 500ms | Time-series, rollups | Medium | | Apache Pinot | Very fast | < 100ms | User-facing analytics (LinkedIn) | Medium |

ClickHouse is the emerging default for real-time analytics due to its performance, SQL compatibility, and low operational overhead.

Lambda vs. Kappa Architecture

Lambda (dual path):

Speed layer: Stream processing (Flink) → real-time view (Redis, ClickHouse)
Batch layer: Scheduled ETL (Spark) → batch view (data lake, warehouse)
Serving layer: Merge real-time and batch views at query time

Pros: Correctness via batch recomputation. Cons: Complexity of maintaining two code paths.

Kappa (single path):

Everything is a stream. Batch is a special case of streaming (bounded stream).
Single processing layer (Flink) writes to serving layer (ClickHouse, Druid)

Pros: Simpler codebase, unified processing. Cons: Harder to correct historical errors; requires exactly-once processing.

Recommendation: Start with Kappa. Add Lambda-style batch recomputation only if you need absolute historical correctness and can tolerate the complexity.

Reference Architecture

Production Event Streaming Platform

Sources:
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ Web Apps    │  │ Mobile Apps │  │ IoT Devices │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       └────────────────┴────────────────┘
                          │
                    ┌─────┴─────┐
                    │  Kafka    │
                    │  Cluster  │
                    └─────┬─────┘
                          │
       ┌──────────────────┼──────────────────┐
       │                  │                  │
┌──────┴──────┐  ┌──────┴──────┐  ┌──────┴──────┐
│  Flink      │  │  ksqlDB     │  │  Consumers  │
│  (fraud,    │  │  (metrics,  │  │  (apps,     │
│   ML)       │  │   analytics)│  │   warehouses)│
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                  │                  │
┌──────┴──────┐  ┌──────┴──────┐  ┌──────┴──────┐
│ ClickHouse  │  │  Redis      │  │  Snowflake  │
│ (analytics) │  │  (real-time)│  │  (warehouse)│
└─────────────┘  └─────────────┘  └─────────────┘

Observability:

Kafka: Lag metrics, consumer group health, broker throughput (JMX → Prometheus → Grafana)
Flink: Job checkpoint duration, backpressure, state size
ClickHouse: Query latency, merge performance, replication lag
End-to-end: Event latency from source to sink (OpenTelemetry spans)

Operational runbooks:

Kafka partition rebalancing procedure
Flink job failure recovery (savepoint restore)
ClickHouse replica sync troubleshooting
CDC connector stall detection and restart

Conclusion

Real-time data architecture is not just about technology selection - it is about rethinking how data flows through your organisation. The shift from batch to streaming changes not just infrastructure but also team structure, application design, and business decision-making.

The organisations that succeed start with a clear use case (fraud detection, real-time personalisation, operational monitoring), build the minimum viable streaming pipeline, and evolve the platform as demand grows. They invest in observability from day one, because debugging distributed stream processing is significantly harder than debugging batch jobs.

Devmonix Technologies designs and operates real-time data platforms for enterprises across fintech, logistics, and digital media. Our data engineering team brings deep expertise in Kafka, Flink, ClickHouse, and data mesh implementations. Whether you are modernising a batch-based data warehouse or building a greenfield event streaming platform, we provide the architecture, implementation, and operational partnership to make it production-grade.

Next step: Request a complimentary Data Architecture Assessment. We will evaluate your current data platform, identify real-time use cases with highest business impact, and deliver a phased modernisation roadmap.

Strategic Report · 2026

Download the Full Report

An in-depth technical guide for data engineers and architects covering streaming platforms, event-driven architecture, CDC, data mesh, and real-time analytics with practical implementation patterns and technology selection frameworks.

Download PDF

What's Inside

1
Executive Summary - why real-time data has become a competitive differentiator and what it takes to build it
2
Streaming Platforms - Kafka, Pulsar, and Kinesis: selection, sizing, and operational patterns
3
Event-Driven Architecture - event schemas, sagas, CQRS, and the outbox pattern
4
Change Data Capture - Debezium, Maxwell, and native CDC: implementation and gotchas
5
Stream Processing - Flink, Spark Streaming, ksqlDB, and materialised views
6
Data Mesh - domain-oriented ownership, federated governance, and self-serve data platforms
7
Real-Time Analytics - ClickHouse, Druid, Pinot, and the Lambda vs. Kappa architecture debate
8
Reference Architecture - a production-ready event streaming platform with full observability

Related Reports

Platform Engineering

Platform Engineering: Building Internal Developer Platforms That Scale

25 min read Security

Zero Trust Security Architecture for Modern Applications

26 min read Cloud & FinOps

Cloud Cost Engineering: The FinOps Playbook for Scale

24 min read

Start a conversation

Tell us about your project and we'll architect a solution that fits your team, timeline, and goals.

Strategic Report · 2026

Download the Full Report

Download PDF

What's Inside

1
Executive Summary - why real-time data has become a competitive differentiator and what it takes to build it
2
Streaming Platforms - Kafka, Pulsar, and Kinesis: selection, sizing, and operational patterns
3
Event-Driven Architecture - event schemas, sagas, CQRS, and the outbox pattern
4
Change Data Capture - Debezium, Maxwell, and native CDC: implementation and gotchas
5
Stream Processing - Flink, Spark Streaming, ksqlDB, and materialised views
6
Data Mesh - domain-oriented ownership, federated governance, and self-serve data platforms
7
Real-Time Analytics - ClickHouse, Druid, Pinot, and the Lambda vs. Kappa architecture debate
8
Reference Architecture - a production-ready event streaming platform with full observability

Related Reports

Platform Engineering

Platform Engineering: Building Internal Developer Platforms That Scale

25 min read Security

Zero Trust Security Architecture for Modern Applications

26 min read Cloud & FinOps

Cloud Cost Engineering: The FinOps Playbook for Scale

24 min read

Start a conversation

Tell us about your project and we'll architect a solution that fits your team, timeline, and goals.

✓Response within 24 hours
✓No-commitment discovery call
✓Fixed-price or T&M engagements
✓95% client satisfaction rate

Start Your Transformation Today.

Let's explore how Devmonix Technologies can drive success for your business.

{ "event_id": "uuid", "event_type": "order.placed", "timestamp": "2026-05-28T14:32:00Z", "source": "order-service", "correlation_id": "uuid", "schema_version": "2.1", "payload": { ... } }

Sources: ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Web Apps │ │ Mobile Apps │ │ IoT Devices │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ └────────────────┴────────────────┘ │ ┌─────┴─────┐ │ Kafka │ │ Cluster │ └─────┬─────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ │ Flink │ │ ksqlDB │ │ Consumers │ │ (fraud, │ │ (metrics, │ │ (apps, │ │ ML) │ │ analytics)│ │ warehouses)│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ ┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ │ ClickHouse │ │ Redis │ │ Snowflake │ │ (analytics) │ │ (real-time)│ │ (warehouse)│ └─────────────┘ └─────────────┘ └─────────────┘