This page is part of the ForgeSDLC knowledge base — an AI-assisted, human-directed methodology for taking product work from concept to production. For the core operating model and vocabulary, see Forge SDLC overview and What is ForgeSDLC?.

Data processing engines & platforms

Purpose: Project-agnostic guide to selecting and reasoning about batch, stream, and interactive processing engines — Spark, Flink, Kafka ecosystem, Beam, and common cloud managed services.

Audience: Teams using blueprints/disciplines/data/bigdata/. Align engine choice with architecture patterns in architectures/README.md and principles in BIGDATA.md.

1. Overview

Processing technology should follow workload shape: data volume, latency target, state requirements, semantics (at-least-once vs exactly-once), windowing, and operational constraints (managed vs self-run). No single engine wins all dimensions; the goal is fit and team leverage.

2. Processing paradigm comparison

Paradigm	Examples	Typical latency	Notes
Batch	MapReduce, Spark batch	Minutes to days	Full scans, large aggregations, cheap fault tolerance via replay
Micro-batch	Spark Structured Streaming (default)	Seconds to minutes	Simplicity; “almost streaming”
Stream	Flink, Kafka Streams	Sub-second to seconds	Event-time, state, complex CEP patterns
Interactive SQL	Trino/Presto, BigQuery, Redshift	Seconds (varies)	Ad hoc analytics over federated or warehouse data

3. Apache Spark (deep dive)

Architecture: Driver holds the SparkContext / session, schedules stages and tasks; executors run tasks on cluster workers. A cluster manager (standalone, YARN, Kubernetes) allocates resources.

APIs (conceptual lineage): RDD (low-level resilience), DataFrame (Catalyst-optimized columnar plans), Dataset (typed, language-dependent), Spark SQL (SQL over DataFrames). Most teams standardize on DataFrame/SQL.

Deployment: Standalone for dev/small clusters; YARN in legacy Hadoop estates; Kubernetes for cloud-native; Databricks for managed Spark with opinionated lakehouse integration.

Fit: Large batch ETL/ELT, ML feature engineering at scale, lakehouse SQL — avoid for tiny data where overhead dominates.

4. Apache Flink (deep dive)

Architecture: JobManager coordinates execution, checkpoints, recovery; TaskManagers run operators in parallel subtasks with managed state backends.

APIs: DataStream (primary for event-time streaming), Table API, Flink SQL — unified in planner for many pipelines.

Exactly-once (typical pattern): Checkpointing persists operator state + source offsets in a barrier-aligned snapshot; on failure, replay from last consistent checkpoint with idempotent sinks or transactional writes.

Watermarks bound event-time progress; event time vs processing time determines correctness under delay.

Fit: Low-latency aggregation, session windows, complex event-time logic, unified batch+stream on one engine.

5. Apache Kafka & Kafka Streams

Kafka is a durable, partitioned log: producers append; consumers read with offsets; retention enables replay.

Kafka Streams embeds stream processing in JVM applications: state stores (RocksDB), repartition topics, interactive queries in some setups.

ksqlDB (ecosystem) exposes SQL over streams and tables for faster path analytics — product positioning evolves; validate against current Confluent/docs.

Fit: Event-driven microservices, stream processing co-located with app deployment, log-centric architectures.

6. Comparison matrix: Spark vs Flink vs Kafka Streams vs Beam

Dimension	Spark	Flink	Kafka Streams	Beam
Latency	Micro-batch default; continuous modes vary	Strong stream latency story	Low for per-app topologies	Depends on runner (Flink, Dataflow, Spark)
Throughput	Very high batch	High stream/batch	High per cluster of apps	Runner-dependent
State management	Structured Streaming state	First-class, rich	Embedded stores	Abstracted; runner implements
Exactly-once	Sink-dependent; transactional sinks	Strong checkpoint narrative	EOS with idempotent/txn sinks	Runner-dependent
Windowing	Rich	Rich event-time	Rich for JVM apps	Model portable
Ecosystem	Massive (SQL, ML, lake)	Strong stream + growing SQL	Kafka-centric	Portable pipelines
Managed offerings	EMR, Dataproc, Databricks, Glue	Managed Flink (various vendors)	Confluent Cloud patterns	GCP Dataflow, others via runners

7. Cloud-managed services (illustrative)

Provider	Examples	Engine / model notes
Databricks	Spark-first lakehouse	Unity Catalog, Delta; premium support for unified analytics
AWS	EMR, Glue, Kinesis, MSK	Mix of Spark, Flink (where offered), streaming ingestion
GCP	Dataflow (Beam), Dataproc, BigQuery	Serverless SQL warehouse + Beam runner maturity
Azure	Synapse, HDInsight, Stream Analytics	SQL + Spark + stream analytics options

Pricing models differ: per DPU/cu-hour, per TB scanned, per slot — normalize to workload replay before committing.

8. Decision flowchart (engine shortlist)

9. Data serialization formats

Format	Schema evolution	Compression	Query performance	Human readability
Avro	Strong with compatibility rules	Good	Good in Hadoop/Spark stacks	Binary
Parquet	Column stats + nested types; often paired with Iceberg/Delta	Excellent	Excellent for analytics	Binary
ORC	Similar columnar niche	Excellent	Strong in Hive/legacy	Binary
Protobuf	Strong for services	Compact	Not a lake-first format	Binary
JSON	Weak without contracts	Poor	Poor at scale	Excellent

Practical pattern: Protobuf/Avro on the wire; Parquet (or ORC) in the lake.

10. Anti-patterns

Anti-pattern	Why it hurts
Wrong engine for workload	Paying cluster cost for interactive latency on batch-sized joins
Premature Spark for small data	Driver/scheduler overhead dominates; use SQL DB or Polars/DuckDB-class tools
Ignoring backpressure	OOM, lag explosions, cascading failures in streaming
Beam without runner discipline	“Portable” on paper but only one runner tested in production

11. External references

Resource	URL
Apache Spark	https://spark.apache.org/
Apache Flink	https://flink.apache.org/
Apache Kafka	https://kafka.apache.org/
Apache Beam	https://beam.apache.org/

Keep project-specific data architecture decisions in docs/adr/ and pipeline documentation in docs/development/, not in this file.

Navigate