ForgeSDLC
Navigate
Home
Discover ForgeSDLC (101)
Practice (201)
Master (301)

This page is part of the ForgeSDLC knowledge base — an AI-assisted, human-directed methodology for taking product work from concept to production. For the core operating model and vocabulary, see Forge SDLC overview and What is ForgeSDLC?.

Big data & data engineering body of knowledge

This document maps the core concerns of data engineering — data architecture, pipelines, governance, quality, and lifecycle management — to the blueprint ecosystem.

How data engineering relates to PDLC and SDLC: Data engineering is a cross-cutting discipline that provides data infrastructure for both lifecycles. See BIGDATA-SDLC-PDLC-BRIDGE.md for the full mapping.

Architectures: Data architecture patterns (Lambda, Kappa, data mesh, etc.) are in architectures/.

Technologies: Processing framework and tooling guidance is in technologies/.


1. Data engineering principles

Principle Description
Data as a product Treat data sets as products with defined consumers, SLOs, documentation, and ownership
Automation first Automate data pipelines, quality checks, and provisioning; minimize manual data handling
Schema on read vs write Choose schema enforcement strategy based on use case — strict schemas for operational data, flexible for exploratory
Idempotency Data pipelines should produce the same result when run multiple times with the same input
Lineage and observability Track data from source to consumption; understand transformations; detect anomalies
Least privilege Restrict data access to what each consumer needs; classify data by sensitivity
Cost awareness Data storage and processing costs scale with volume; optimize for cost-efficiency without sacrificing quality

2. Data governance

DMBOK knowledge areas (relevant subset)

Knowledge area Core question Key outputs
Data governance Who owns data, and how are data decisions made? Data governance charter, stewardship roles, decision rights
Data quality Is the data fit for its intended use? Quality rules, profiling results, quality dashboards
Metadata management What does the data mean, and where does it come from? Data catalog, glossary, lineage graphs
Data security Who can access what data, and how is it protected? Access policies, encryption standards, audit logs
Data architecture How is data organized, integrated, and made available? Data models, integration patterns, reference architecture
Data integration How does data flow between systems? ETL/ELT pipelines, API contracts, CDC streams

Data quality dimensions

Dimension Definition Measurement example
Accuracy Data correctly represents the real-world entity or event % of records matching authoritative source
Completeness Required data elements are present and not null % of non-null values for required fields
Consistency Same data is represented the same way across systems Cross-system reconciliation match rate
Timeliness Data is available when needed; freshness meets SLA Data age at consumption time vs SLA
Validity Data conforms to defined business rules and formats % of records passing validation rules
Uniqueness No unintended duplicate records Duplicate detection rate

3. Data pipeline patterns

Pattern Description Best fit
Batch ETL Extract, transform, load on a schedule (hourly, daily) Reporting, data warehousing, low-latency not required
Batch ELT Extract, load raw, then transform in-place Cloud data warehouses (BigQuery, Snowflake, Redshift) where compute is elastic
Stream processing Process events in real-time or near-real-time as they arrive Real-time analytics, alerting, event-driven architectures
Change data capture (CDC) Capture row-level changes from source databases Database replication, keeping derived stores in sync
Micro-batch Small, frequent batch jobs (seconds to minutes) Near-real-time needs without full streaming complexity
Data virtualization Query data in place without moving it Federation across multiple sources without ETL overhead

4. DataOps

DataOps applies DevOps principles to data:

DevOps practice DataOps equivalent
CI/CD Automated pipeline deployment; schema migration CI; data transformation testing
Infrastructure as Code Data infrastructure provisioning (databases, warehouses, streaming clusters) via IaC
Automated testing Data quality checks in pipeline; schema validation; reconciliation tests
Monitoring/alerting Data freshness monitoring; pipeline health; quality metric dashboards
Incident management Data incident process — stale data, quality degradation, pipeline failure
Version control Schema versioning; transformation code in git; dbt models versioned

5. Competencies

Competency Description
Data modeling Designing logical and physical data models — relational, dimensional, graph, document
Distributed systems Understanding partitioning, replication, consistency, and fault tolerance
SQL and query optimization Writing efficient queries; understanding query plans; index design
Pipeline engineering Building reliable, idempotent, observable data pipelines
Cloud platforms Working with cloud data services (storage, compute, streaming, analytics)
Data governance Implementing classification, access control, lineage, and quality

6. External references

Topic URL Why it is linked
DAMA DMBOK https://www.dama.org/cpages/body-of-knowledge Canonical data management body of knowledge
Data Mesh (Zhamak Dehghani) https://www.datamesh-architecture.com/ Domain-oriented, product-thinking approach to analytical data
Fundamentals of Data Engineering (Reis, Housley) https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/ Modern data engineering lifecycle and practices
The Data Warehouse Toolkit (Kimball) https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/ Dimensional modeling — the standard reference
Designing Data-Intensive Applications (Kleppmann) https://dataintensive.net/ Distributed data systems — storage, replication, partitioning, batch/stream processing

Keep project-specific data documentation in docs/product/data/ and data architecture decisions in docs/adr/, not in this file.