This page is part of the ForgeSDLC knowledge base — an AI-assisted, human-directed methodology for taking product work from concept to production. For the core operating model and vocabulary, see Forge SDLC overview and What is ForgeSDLC?.
SRE and observability (blueprint)
Site Reliability Engineering (SRE) and observability are complementary practices that ensure production systems are reliable, understandable, and continuously improving. SRE provides the operational framework (SLOs, error budgets, toil reduction); observability provides the technical capability to understand system behavior.
This guide covers SLOs/SLIs/error budgets, the three pillars of observability, alerting philosophy, chaos engineering, on-call practices, and incident learning.
1. SRE foundations
SLOs, SLIs, and error budgets
Concept
Definition
Example
SLI (Service Level Indicator)
A quantitative measure of service behavior
Request latency P99, error rate, availability
SLO (Service Level Objective)
Target value for an SLI over a time window
P99 latency < 200 ms over 30 days; availability >= 99.9% per quarter
SLA (Service Level Agreement)
Contractual commitment (SLO + consequences)
99.9% uptime; credits issued below threshold
Error budget
Allowed unreliability = 1 - SLO
99.9% SLO → 0.1% error budget → ~43 min downtime/month
Error budget policy
Budget status
Action
Healthy (> 50% remaining)
Normal development velocity; feature work prioritized
Caution (25–50% remaining)
Increase review rigor; prioritize reliability improvements alongside features
Critical (< 25% remaining)
Freeze non-critical changes; dedicate engineering to reliability work
Exhausted (0%)
Stop all feature work until budget recovers; post-incident analysis for each new incident
Toil
Characteristic
Description
Manual
Requires human intervention, not automated
Repetitive
Happens over and over; not a one-time task
Automatable
Could be handled by software
Reactive
Triggered by alerts or requests, not proactive
Without enduring value
Does not improve the system permanently; must be done again
Target: Keep toil below 50% of SRE team capacity; invest the remainder in automation and engineering.
2. Observability — the three pillars
Logs
Concern
Guidance
Structured logging
JSON or key-value format; include trace ID, span ID, request ID, user ID (anonymized)