July 2, 2026

Verification Gates Scale With Autonomy

Best for: CTOs, CISOs, quality leaders, and platform owners who already accept that verification is the new bottleneck and need a concrete model for how proof requirements grow with agent scope.

Use outside Forge: High. The opening applies to any AI-assisted delivery program; Forge’s L1–L3 ladder is a worked example in the closing section.

Summary

When agents produce more candidate change faster, the scarce resource is confidence—not generation. One under-discussed implication: the evidence bar must scale with the autonomy you claim. A run that may change one function needs a different proof package than a run that may touch logic, docs, UI, and E2E behavior—or a multi-repo feature.

Treating every unattended run as “run tests and merge” is how teams get fast breakage with a confident demo.

The verification mistake at scale

Many organizations adopted AI coding assistants by measuring usage: licenses, daily active users, commits attributed to AI. Fewer measured whether verification capacity scaled with scope of unattended change.

The result matches what industry surveys report: generation rises; trust and review do not keep pace. Verification becomes the drag—not because tests are obsolete, but because the unit of delivery got bigger without the gate getting bigger.

A simple scaling model

Think in three layers of proof—framework-agnostic:

Scope of unattended change	Minimum proof posture
Single contract-bound edit	Tests pass; risks reviewed; human approves merge
Multi-file change-set	Above + proof that multiple distinct files changed when AC requires it—not one file dressed as many
Use-case slice	Above + cross-layer evidence and end-to-end verification recorded
Feature / product increment	Above + ADR and release gates; multi-repo CI where applicable

Higher scope adds gates. It does not relax lower-level requirements.

Forge’s worked ladder (L1–L3 demonstrated)

Forge documents this as the L0–L8 execution ladder. The first three demonstrated levels map cleanly onto the scaling model:

L1 — Function. One method or contract-bound change. Core assay evidence: tests_pass, acceptance_criteria_met, risks_reviewed. Human gate: approve branch/merge. Real runs include zero-token deterministic fixes and single-file link repairs (bounded examples).

L2 — Change-set. Multi-file fix without rearchitecture. Assay adds ≥2 distinct changed files in the proof union when the run passes—a single-file patch cannot masquerade as L2.

L3 — Use-case slice. End-to-end flow in one app. Assay adds ≥2 layers, both .py and non-.py files, and E2E recorded in tests_run. Example 7 (lenses-production-l3-ci) is a green demonstrated run—not a table-only aspiration.

L4+ adds ADR, go/no-go, and strategic checkpoints. Those levels are defined in policy and vision in implementation—not claimed as production-ready unattended delivery today.

Why this matters for security and compliance leaders

Security reviews often ask whether AI-generated code is “verified.” The better question is whether verification is sized to the autonomy claim:

An L1 run that only fixes a broken link in README.md should not be audited as if it rewrote authentication.
An L3 run that touches logic, docs, and UI should be audited for E2E and cross-layer proof—not only unit tests on one file.

Mis-sized gates create two failure modes: over-trust (small proof, large claim) and under-throughput (enterprise proof bar on every typo fix).

What we do not claim

No substitute for domain-specific compliance — the ladder is engineering governance, not certification.
No proof that L4–L8 runs are automated today — vision gates are documented; enforcement is not.
No guarantee that green CI means market-ready — Assay Gate and release decisions remain human-owned in Forge prescriptive guidance.

Go deeper

The New Bottleneck Is Verification, Not Coding — executive context
Autonomy levels (policy)
Platform autonomy hub — readiness matrix
Bounded execution examples — L1–L3 evidence

Navigate