Verification Gates Scale With Autonomy
Best for: CTOs, CISOs, quality leaders, and platform owners who already accept that verification is the new bottleneck and need a concrete model for how proof requirements grow with agent scope.
Use outside Forge: High. The opening applies to any AI-assisted delivery program; Forge’s L1–L3 ladder is a worked example in the closing section.
Summary
When agents produce more candidate change faster, the scarce resource is confidence—not generation. One under-discussed implication: the evidence bar must scale with the autonomy you claim. A run that may change one function needs a different proof package than a run that may touch logic, docs, UI, and E2E behavior—or a multi-repo feature.
Treating every unattended run as “run tests and merge” is how teams get fast breakage with a confident demo.
The verification mistake at scale
Many organizations adopted AI coding assistants by measuring usage: licenses, daily active users, commits attributed to AI. Fewer measured whether verification capacity scaled with scope of unattended change.
The result matches what industry surveys report: generation rises; trust and review do not keep pace. Verification becomes the drag—not because tests are obsolete, but because the unit of delivery got bigger without the gate getting bigger.
A simple scaling model
Think in three layers of proof—framework-agnostic:
| Scope of unattended change | Minimum proof posture |
|---|---|
| Single contract-bound edit | Tests pass; risks reviewed; human approves merge |
| Multi-file change-set | Above + proof that multiple distinct files changed when AC requires it—not one file dressed as many |
| Use-case slice | Above + cross-layer evidence and end-to-end verification recorded |
| Feature / product increment | Above + ADR and release gates; multi-repo CI where applicable |
Higher scope adds gates. It does not relax lower-level requirements.
Forge’s worked ladder (L1–L3 demonstrated)
Forge documents this as the L0–L8 execution ladder. The first three demonstrated levels map cleanly onto the scaling model:
L1 — Function. One method or contract-bound change. Core assay evidence: tests_pass, acceptance_criteria_met, risks_reviewed. Human gate: approve branch/merge. Real runs include zero-token deterministic fixes and single-file link repairs (bounded examples).
L2 — Change-set. Multi-file fix without rearchitecture. Assay adds ≥2 distinct changed files in the proof union when the run passes—a single-file patch cannot masquerade as L2.
L3 — Use-case slice. End-to-end flow in one app. Assay adds ≥2 layers, both .py and non-.py files, and E2E recorded in tests_run. Example 7 (lenses-production-l3-ci) is a green demonstrated run—not a table-only aspiration.
L4+ adds ADR, go/no-go, and strategic checkpoints. Those levels are defined in policy and vision in implementation—not claimed as production-ready unattended delivery today.
Why this matters for security and compliance leaders
Security reviews often ask whether AI-generated code is “verified.” The better question is whether verification is sized to the autonomy claim:
- An L1 run that only fixes a broken link in
README.mdshould not be audited as if it rewrote authentication. - An L3 run that touches logic, docs, and UI should be audited for E2E and cross-layer proof—not only unit tests on one file.
Mis-sized gates create two failure modes: over-trust (small proof, large claim) and under-throughput (enterprise proof bar on every typo fix).
What we do not claim
- No substitute for domain-specific compliance — the ladder is engineering governance, not certification.
- No proof that L4–L8 runs are automated today — vision gates are documented; enforcement is not.
- No guarantee that green CI means market-ready — Assay Gate and release decisions remain human-owned in Forge prescriptive guidance.
Go deeper
- The New Bottleneck Is Verification, Not Coding — executive context
- Autonomy levels (policy)
- Platform autonomy hub — readiness matrix
- Bounded execution examples — L1–L3 evidence
Related: Autonomy Is Not a Switch · Governance Is Becoming a Performance Function