This page is part of the ForgeSDLC knowledge base — an AI-assisted, human-directed methodology for taking product work from concept to production. For the core operating model and vocabulary, see Forge SDLC overview and What is ForgeSDLC?.

Lambda, Kappa & unified data architectures

Purpose: Project-agnostic guide to batch–stream hybrid and unified streaming architectures: how they evolved, how they differ, and how to choose among them — including lakehouse as a common unified-storage pattern.

Audience: Teams using blueprints/disciplines/data/bigdata/. Pair with BIGDATA.md §1 (principles) for the broader decision framework.

1. Overview: from batch-only to unified streaming

Early big-data systems were batch-only: periodic jobs recomputed views from full or incremental dumps. That model is simple and can be very accurate for historical analytics, but latency is bounded by the batch interval.

Lambda architecture added a speed layer so low-latency views could coexist with a batch layer that recomputes authoritative state. Kappa architecture questioned whether two paths were necessary: if the immutable event log is the system of record, a single stream-processing stack can often replace batch by replaying the log.

Unified approaches (modern stream processors with batch APIs, or lakehouse table formats with ACID + time travel) aim to one engine and one storage abstraction for both reprocessing and serving, reducing operational duplication while preserving flexibility.

2. Lambda architecture (deep dive)

Lambda splits work across three conceptual layers:

Layer	Role	Typical traits
Batch layer	Precompute complete, accurate views from raw/master data	High latency (hours–days acceptable); full recomputation or large incremental windows
Speed layer	Compensate for batch lag with incremental or approximate updates	Low latency (seconds–minutes); may diverge until batch catches up
Serving layer	Merge batch and real-time outputs for queries	Key-value or low-latency serving; clients see a unified read API

When it shines: You need provably correct historical recomputation (e.g., late-arriving facts, complex corrections) and sub-batch latency for some use cases.

3. Kappa architecture (deep dive)

Kappa removes the separate batch path: stream processing consumes an append-only log; reprocessing means resetting offsets or replaying topics through the same topology (often with upgraded code).

When it shines: The domain can be modeled as events, idempotent sinks exist, and replay is an acceptable substitute for heavyweight batch recompute.

4. Comparison matrix: Lambda vs Kappa vs unified

Dimension	Lambda	Kappa	Unified (e.g., Flink batch+stream, Databricks/lakehouse)
Complexity	High (two code paths + merge)	Medium (one path; replay discipline)	Medium–high (one platform; still many knobs)
Latency	Speed layer can be very low	Low if engine supports it	Low to moderate depending on product
Reprocessing	Batch layer is natural	Replay / reset offsets	Replay, time travel, or batch modes on same stack
Consistency	Batch “wins” after merge	Depends on sink semantics + exactly-once	Table formats + transactions improve cross-batch consistency
Operational overhead	Operate batch + stream + serving merge	Operate log + stream + state	Fewer moving parts than classic Lambda; vendor/managed variance
Cost	Often higher (duplicate compute/storage patterns)	Log retention + state store costs	Consolidation can reduce waste; premium managed tiers vary
Team skills	Batch + streaming + serving	Streaming-first + operational log hygiene	Platform-specific expertise (Spark/Flink/Delta, etc.)

5. Decision flowchart (architecture choice)

6. Technology mapping (illustrative)

Pattern	Example stacks (not exhaustive)
Lambda	Hadoop batch (MapReduce / Spark) + Apache Storm (historical) + serving DB; Spark batch + Flink/Kafka Streams speed layer
Kappa	Apache Kafka + Apache Flink (or Kafka Streams) + compacted topics / materialized views
Unified / lakehouse	Databricks (Delta Lake), Apache Iceberg or Hudi on object storage + Spark/Flink/Trino; managed lakehouse offerings

Tools evolve: treat this table as family resemblance, not a rigid taxonomy.

7. Lakehouse: lake + warehouse concerns

A lakehouse keeps cheap object storage as the main store while adding warehouse-like features: transactions, schema enforcement, time travel, and incremental processing against open table formats.

Format	Notable strengths	Typical integration notes
Delta Lake	ACID, time travel, wide Spark/Databricks ecosystem	Strong in Spark-first shops; UniForm / multi-format bridges vary by vendor
Apache Iceberg	Partition evolution, hidden partitioning, strong catalog story	Popular for open, multi-engine (Spark, Flink, Trino) neutrality
Apache Hudi	Incremental processing, record-level upserts	Common for CDC-heavy, near-real-time lake patterns

Lakehouse does not by itself replace organizational patterns (e.g., data mesh); it is primarily a storage and execution unification play.

8. Data quality by architecture

Concern	Lambda	Kappa	Unified / lakehouse
Where validation runs	Batch jobs (authoritative); speed layer (subset / heuristic)	Stream validators; replay jobs for regression	Bronze/silver/gold or equivalent zones; stream + batch on same tables
Schema enforcement	Strong in batch; speed layer may use looser contracts	Registry + compatibility; stream–table contracts	Table constraints, constraints in engines, catalog policies
Dead letter queues	Less central; batch quarantine tables	Critical — poison messages and bad keys	Same as stream + batch: DLQ topics and rejected row tables

9. Migration patterns

From	Toward	Practical pattern
Batch-only	+ streaming	Add a log at source; implement speed views; keep batch as source of truth until merge semantics are trusted
Lambda	Unified / Kappa	Strangler: move merge logic toward single store; shorten batch cycle; prove replay replaces batch for subsets
Legacy DW	Lakehouse	Land raw in object storage; incremental ELT; virtualize or migrate marts; retire duplicate copies deliberately

10. Anti-patterns

Anti-pattern	Why it hurts
Lambda without a clear merge story	Inconsistent reads; “which number is true?” becomes permanent
Kappa ignoring late data	Silent drift; dashboards disagree with finance after corrections
Schema-on-read chaos in the lake	Unbounded consumer breakage; untestable pipelines
Unified platform, dual reality	One vendor stack but teams still run shadow batch in spreadsheets

11. External references

Reference	Why read it
Nathan Marz, Big Data: Principles and best practices of scalable realtime data systems (Manning)	Original Lambda framing and motivation
Jay Kreps, “Questioning the Lambda Architecture” (O’Reilly Radar, 2014)	Kappa-style critique and stream-log centrality
Databricks lakehouse papers and product docs	Unified storage + engine positioning (evaluate vendor claims against your workloads)

Keep project-specific data architecture decisions in docs/adr/ and pipeline documentation in docs/development/, not in this file.

Navigate