This page is part of the ForgeSDLC knowledge base — an AI-assisted, human-directed methodology for taking product work from concept to production. For the core operating model and vocabulary, see Forge SDLC overview and What is ForgeSDLC?.
Big data & data engineering body of knowledge
This document maps the core concerns of data engineering — data architecture, pipelines, governance, quality, and lifecycle management — to the blueprint ecosystem.
How data engineering relates to PDLC and SDLC: Data engineering is a cross-cutting discipline that provides data infrastructure for both lifecycles. See BIGDATA-SDLC-PDLC-BRIDGE.md for the full mapping.
Architectures: Data architecture patterns (Lambda, Kappa, data mesh, etc.) are in architectures/.
Technologies: Processing framework and tooling guidance is in technologies/.
1. Data engineering principles
| Principle | Description |
|---|---|
| Data as a product | Treat data sets as products with defined consumers, SLOs, documentation, and ownership |
| Automation first | Automate data pipelines, quality checks, and provisioning; minimize manual data handling |
| Schema on read vs write | Choose schema enforcement strategy based on use case — strict schemas for operational data, flexible for exploratory |
| Idempotency | Data pipelines should produce the same result when run multiple times with the same input |
| Lineage and observability | Track data from source to consumption; understand transformations; detect anomalies |
| Least privilege | Restrict data access to what each consumer needs; classify data by sensitivity |
| Cost awareness | Data storage and processing costs scale with volume; optimize for cost-efficiency without sacrificing quality |
2. Data governance
DMBOK knowledge areas (relevant subset)
| Knowledge area | Core question | Key outputs |
|---|---|---|
| Data governance | Who owns data, and how are data decisions made? | Data governance charter, stewardship roles, decision rights |
| Data quality | Is the data fit for its intended use? | Quality rules, profiling results, quality dashboards |
| Metadata management | What does the data mean, and where does it come from? | Data catalog, glossary, lineage graphs |
| Data security | Who can access what data, and how is it protected? | Access policies, encryption standards, audit logs |
| Data architecture | How is data organized, integrated, and made available? | Data models, integration patterns, reference architecture |
| Data integration | How does data flow between systems? | ETL/ELT pipelines, API contracts, CDC streams |
Data quality dimensions
| Dimension | Definition | Measurement example |
|---|---|---|
| Accuracy | Data correctly represents the real-world entity or event | % of records matching authoritative source |
| Completeness | Required data elements are present and not null | % of non-null values for required fields |
| Consistency | Same data is represented the same way across systems | Cross-system reconciliation match rate |
| Timeliness | Data is available when needed; freshness meets SLA | Data age at consumption time vs SLA |
| Validity | Data conforms to defined business rules and formats | % of records passing validation rules |
| Uniqueness | No unintended duplicate records | Duplicate detection rate |
3. Data pipeline patterns
| Pattern | Description | Best fit |
|---|---|---|
| Batch ETL | Extract, transform, load on a schedule (hourly, daily) | Reporting, data warehousing, low-latency not required |
| Batch ELT | Extract, load raw, then transform in-place | Cloud data warehouses (BigQuery, Snowflake, Redshift) where compute is elastic |
| Stream processing | Process events in real-time or near-real-time as they arrive | Real-time analytics, alerting, event-driven architectures |
| Change data capture (CDC) | Capture row-level changes from source databases | Database replication, keeping derived stores in sync |
| Micro-batch | Small, frequent batch jobs (seconds to minutes) | Near-real-time needs without full streaming complexity |
| Data virtualization | Query data in place without moving it | Federation across multiple sources without ETL overhead |
4. DataOps
DataOps applies DevOps principles to data:
| DevOps practice | DataOps equivalent |
|---|---|
| CI/CD | Automated pipeline deployment; schema migration CI; data transformation testing |
| Infrastructure as Code | Data infrastructure provisioning (databases, warehouses, streaming clusters) via IaC |
| Automated testing | Data quality checks in pipeline; schema validation; reconciliation tests |
| Monitoring/alerting | Data freshness monitoring; pipeline health; quality metric dashboards |
| Incident management | Data incident process — stale data, quality degradation, pipeline failure |
| Version control | Schema versioning; transformation code in git; dbt models versioned |
5. Competencies
| Competency | Description |
|---|---|
| Data modeling | Designing logical and physical data models — relational, dimensional, graph, document |
| Distributed systems | Understanding partitioning, replication, consistency, and fault tolerance |
| SQL and query optimization | Writing efficient queries; understanding query plans; index design |
| Pipeline engineering | Building reliable, idempotent, observable data pipelines |
| Cloud platforms | Working with cloud data services (storage, compute, streaming, analytics) |
| Data governance | Implementing classification, access control, lineage, and quality |
6. External references
| Topic | URL | Why it is linked |
|---|---|---|
| DAMA DMBOK | https://www.dama.org/cpages/body-of-knowledge | Canonical data management body of knowledge |
| Data Mesh (Zhamak Dehghani) | https://www.datamesh-architecture.com/ | Domain-oriented, product-thinking approach to analytical data |
| Fundamentals of Data Engineering (Reis, Housley) | https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/ | Modern data engineering lifecycle and practices |
| The Data Warehouse Toolkit (Kimball) | https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/ | Dimensional modeling — the standard reference |
| Designing Data-Intensive Applications (Kleppmann) | https://dataintensive.net/ | Distributed data systems — storage, replication, partitioning, batch/stream processing |
Keep project-specific data documentation in docs/product/data/ and data architecture decisions in docs/adr/, not in this file.