Real-Time Data Architecture: From Batch to Streaming at Scale
A comprehensive guide to event-driven systems, stream processing, change data capture, and modern analytics for data-intensive applications
Executive Summary
Data latency is the new competitive battlefield. Organisations that can process, analyse, and act on data in real time deliver superior customer experiences, detect fraud faster, optimise operations dynamically, and make decisions before competitors have even seen the signal.
This whitepaper provides data engineers and architects with a comprehensive guide to real-time data architecture. It covers streaming platforms, event-driven patterns, change data capture, stream processing, data mesh, and real-time analytics - with production-tested patterns and technology selection frameworks.
Key findings:
- Organisations with real-time analytics report 3× faster decision-making and 40% improvement in operational KPIs
- Change Data Capture (CDC) reduces data integration latency from hours to sub-second without application changes
- Event-driven architectures with stream processing handle 10–100× more throughput than equivalent REST-based systems
- Data mesh approaches reduce data team bottlenecks by 60% while improving data quality through domain ownership
Who this is for: VP Data Engineering, Data Architects, Principal Engineers, and Analytics Leaders building or modernising data platforms.
Streaming Platforms
Kafka, Pulsar, and Kinesis: The Selection Framework
| Dimension | Apache Kafka | Apache Pulsar | AWS Kinesis | Winner | |-----------|-------------|---------------|-------------|--------| | Throughput | 2M+ msg/sec per broker | 2M+ msg/sec per broker | 1M+ msg/sec per shard | Kafka/Pulsar | | Latency (p99) | 10–50ms | 5–20ms | 100–200ms | Pulsar | | Multi-tenancy | Good | Excellent (built-in) | Good | Pulsar | | Geo-replication | MirrorMaker | Built-in | Cross-region streams | Pulsar | | Ecosystem | Mature (Confluent) | Growing | AWS-native | Kafka | | Operational complexity | Medium | Medium-High | Low | Kinesis | | Cost (self-managed) | Infrastructure only | Infrastructure only | Per-shard pricing | Kafka/Pulsar |
Decision matrix:
- AWS-only, low operational overhead: Kinesis
- Maximum throughput, mature ecosystem, on-prem or multi-cloud: Kafka (Confluent Cloud or self-managed)
- Multi-tenancy, geo-replication, unified queuing and streaming: Pulsar
Kafka Operational Patterns
Sizing:
- 3 brokers minimum for production (5 for critical workloads)
- Plan for 2× current throughput for 12-month growth
- SSD storage with 1:1 replication factor for hot data
- Separate clusters for ingestion, processing, and analytics if > 500K msg/sec
Partitioning strategy:
- Partition by entity ID (user_id, order_id) for ordering guarantees
- Avoid default partitioning (round-robin) when ordering matters
- Plan for partition count growth; rebalancing is expensive
Data retention:
- Hot data: 7 days on Kafka, queried via ksqlDB or consumer applications
- Warm data: 30–90 days in object storage (S3 with Kafka Tiered Storage)
- Cold data: Indefinite in data lake (Parquet on S3 with Athena/Trino)
Event-Driven Architecture
Event Schema Design
Event schemas are contracts. Breaking them breaks consumers.
Schema evolution rules:
- Use Avro or Protobuf with a schema registry (Confluent Schema Registry)
- Fields may be added (with defaults); fields may be deprecated but never removed
- Version schemas explicitly; enforce compatibility checks in CI/CD
- Include event metadata: event_id, event_type, timestamp, source, correlation_id, schema_version
Example event structure:
{
"event_id": "uuid",
"event_type": "order.placed",
"timestamp": "2026-05-28T14:32:00Z",
"source": "order-service",
"correlation_id": "uuid",
"schema_version": "2.1",
"payload": { ... }
}
The Saga Pattern
For distributed transactions across services, the saga pattern coordinates a sequence of local transactions via events.
Choreography saga: Services react to each other's events. Simple, but hard to trace.
- Order service → OrderPlaced event
- Payment service → PaymentProcessed event
- Inventory service → InventoryReserved event
- Shipping service → ShipmentScheduled event
Orchestration saga: A central coordinator (Saga Orchestrator) manages the flow. Better visibility, but introduces a single point of complexity.
Compensating transactions: Each step must have a rollback action. If payment succeeds but inventory fails, the orchestrator triggers a refund.
CQRS and the Outbox Pattern
CQRS (Command Query Responsibility Segregation): Separate read and write models. Writes go to the transactional database; reads are served from optimised projections.
The Outbox Pattern: To atomically update a database and publish an event:
- Write event to an "outbox" table in the same transaction as the business update
- A separate CDC process reads the outbox table and publishes to Kafka
- Mark events as processed to prevent duplicates
This avoids the dual-write problem (database commit succeeds, but Kafka publish fails) without distributed transactions.
Change Data Capture
CDC Mechanisms
| Mechanism | Trigger | Latency | Impact on Source | Tool | |-----------|---------|---------|-----------------|------| | Log-based | Database write-ahead log | < 1 second | Negligible | Debezium, Maxwell, native | | Polling | Scheduled queries | Minutes | Moderate (query load) | Custom scripts, Fivetran | | Trigger-based | Database triggers | Seconds | High (transaction overhead) | Custom triggers |
Log-based CDC is the production standard. It reads the database's write-ahead log (WAL) directly, capturing every insert, update, and delete with zero application changes and minimal performance impact.
Debezium in Production
Architecture:
MySQL/PostgreSQL → WAL → Debezium Connector → Kafka → Consumers
↓
Schema Registry
Configuration best practices:
- Use
snapshot.mode=when_neededfor initial load;schema_only_recoveryfor restart - Set
max.queue.sizeandmax.batch.sizeto match Kafka producer throughput - Enable heartbeat for idle tables (prevents connector stall on low-activity databases)
- Monitor lag metrics:
debezium.source.database.history.kafka.recovery.poll.interval.ms
Operational challenges:
- Schema changes (ALTER TABLE) require connector restart with schema evolution handling
- Large transactions can cause memory pressure; tune
max.queue.size.in.bytes - Tombstone events for DELETEs must be handled by consumers (compaction vs. soft delete)
Stream Processing
Flink vs. Spark Streaming vs. ksqlDB
| Engine | Latency | Throughput | State Management | Use Case | |--------|---------|-----------|-----------------|----------| | Apache Flink | < 100ms | Very high | Rich (stateful operators, TTL) | Complex event processing, fraud detection, real-time ML | | Spark Structured Streaming | Seconds | High | Basic (mapGroupsWithState) | ETL, analytics, windowed aggregations | | ksqlDB | < 1 second | Medium | Light (tables, windows) | Simple aggregations, filtering, joins | | Kafka Streams | < 100ms | High | Medium (state stores) | Application-level stream processing |
Selection guidance:
- Complex stateful processing (sessionisation, pattern matching): Flink
- Large-scale ETL with some real-time needs: Spark Structured Streaming
- Simple SQL-like operations on Kafka streams: ksqlDB
- Embedded in application, moderate scale: Kafka Streams
Materialised Views on Streams
A materialised view is a continuously updated query result. In streaming:
- Source: Kafka topic (events)
- Processor: Flink job or ksqlDB query
- Sink: Key-value store (Redis, DynamoDB) or database (PostgreSQL, ClickHouse)
- Query API: Low-latency reads from the materialised view
Example: Real-time user activity dashboard
- Events:
page_view,click,purchase→ Kafka - Flink job: aggregate by user, window = 1 minute →
user_activitytable in Redis - Dashboard API: read from Redis with < 10ms latency
Data Mesh
The Four Principles
1. Domain-Oriented Ownership Data is owned by the domain teams that produce it, not a central data team. The order service owns order data; the payment service owns payment data.
2. Data as a Product Each domain treats its data as a product with:
- Clear documentation (schema, semantics, SLAs)
- Discoverability (data catalog, metadata registry)
- Quality guarantees (freshness, completeness, accuracy)
- Accessible interfaces (API, event stream, query endpoint)
3. Self-Serve Data Platform A platform team provides infrastructure for domains to publish, discover, and consume data - without building custom pipelines each time.
4. Federated Governance Global standards (taxonomy, privacy, security) are defined centrally; implementation is domain-owned. Automated policy enforcement (data classification, access control, PII masking) replaces manual review.
Implementation Patterns
| Component | Technology | Purpose | |-----------|-----------|---------| | Data Catalog | DataHub, Collibra, Amundsen | Discoverability, lineage, ownership | | Event Backbone | Kafka, Pulsar | Domain event publishing | | Query Federation | Trino, Presto, Dremio | Cross-domain querying without ETL | | Data Quality | Great Expectations, Soda | Automated quality checks | | Access Control | Apache Ranger, Immuta | Fine-grained, policy-based access |
Real-Time Analytics
ClickHouse, Druid, and Pinot
| Engine | Insertion | Query Latency | Best For | Operational Complexity | |--------|-----------|---------------|----------|----------------------| | ClickHouse | Very fast | < 100ms | Wide range of analytics | Low | | Apache Druid | Fast | < 500ms | Time-series, rollups | Medium | | Apache Pinot | Very fast | < 100ms | User-facing analytics (LinkedIn) | Medium |
ClickHouse is the emerging default for real-time analytics due to its performance, SQL compatibility, and low operational overhead.
Lambda vs. Kappa Architecture
Lambda (dual path):
- Speed layer: Stream processing (Flink) → real-time view (Redis, ClickHouse)
- Batch layer: Scheduled ETL (Spark) → batch view (data lake, warehouse)
- Serving layer: Merge real-time and batch views at query time
Pros: Correctness via batch recomputation. Cons: Complexity of maintaining two code paths.
Kappa (single path):
- Everything is a stream. Batch is a special case of streaming (bounded stream).
- Single processing layer (Flink) writes to serving layer (ClickHouse, Druid)
Pros: Simpler codebase, unified processing. Cons: Harder to correct historical errors; requires exactly-once processing.
Recommendation: Start with Kappa. Add Lambda-style batch recomputation only if you need absolute historical correctness and can tolerate the complexity.
Reference Architecture
Production Event Streaming Platform
Sources:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Web Apps │ │ Mobile Apps │ │ IoT Devices │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────┴────────────────┘
│
┌─────┴─────┐
│ Kafka │
│ Cluster │
└─────┬─────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Flink │ │ ksqlDB │ │ Consumers │
│ (fraud, │ │ (metrics, │ │ (apps, │
│ ML) │ │ analytics)│ │ warehouses)│
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ ClickHouse │ │ Redis │ │ Snowflake │
│ (analytics) │ │ (real-time)│ │ (warehouse)│
└─────────────┘ └─────────────┘ └─────────────┘
Observability:
- Kafka: Lag metrics, consumer group health, broker throughput (JMX → Prometheus → Grafana)
- Flink: Job checkpoint duration, backpressure, state size
- ClickHouse: Query latency, merge performance, replication lag
- End-to-end: Event latency from source to sink (OpenTelemetry spans)
Operational runbooks:
- Kafka partition rebalancing procedure
- Flink job failure recovery (savepoint restore)
- ClickHouse replica sync troubleshooting
- CDC connector stall detection and restart
Conclusion
Real-time data architecture is not just about technology selection - it is about rethinking how data flows through your organisation. The shift from batch to streaming changes not just infrastructure but also team structure, application design, and business decision-making.
The organisations that succeed start with a clear use case (fraud detection, real-time personalisation, operational monitoring), build the minimum viable streaming pipeline, and evolve the platform as demand grows. They invest in observability from day one, because debugging distributed stream processing is significantly harder than debugging batch jobs.
Devmonix Technologies designs and operates real-time data platforms for enterprises across fintech, logistics, and digital media. Our data engineering team brings deep expertise in Kafka, Flink, ClickHouse, and data mesh implementations. Whether you are modernising a batch-based data warehouse or building a greenfield event streaming platform, we provide the architecture, implementation, and operational partnership to make it production-grade.
Next step: Request a complimentary Data Architecture Assessment. We will evaluate your current data platform, identify real-time use cases with highest business impact, and deliver a phased modernisation roadmap.
Strategic Report · 2026
Download the Full Report
An in-depth technical guide for data engineers and architects covering streaming platforms, event-driven architecture, CDC, data mesh, and real-time analytics with practical implementation patterns and technology selection frameworks.
What's Inside
- 1
Executive Summary - why real-time data has become a competitive differentiator and what it takes to build it
- 2
Streaming Platforms - Kafka, Pulsar, and Kinesis: selection, sizing, and operational patterns
- 3
Event-Driven Architecture - event schemas, sagas, CQRS, and the outbox pattern
- 4
Change Data Capture - Debezium, Maxwell, and native CDC: implementation and gotchas
- 5
Stream Processing - Flink, Spark Streaming, ksqlDB, and materialised views
- 6
Data Mesh - domain-oriented ownership, federated governance, and self-serve data platforms
- 7
Real-Time Analytics - ClickHouse, Druid, Pinot, and the Lambda vs. Kappa architecture debate
- 8
Reference Architecture - a production-ready event streaming platform with full observability
Related Reports
Start a conversation
Tell us about your project and we'll architect a solution that fits your team, timeline, and goals.
Start Your Transformation Today.
Let's explore how Devmonix Technologies can drive success for your business.