The Harness Around the Agent: How Stripe Runs 1,000 Unattended Code Reviews per Week

The most important part of Stripe’s AI code review system is not the LLM. Stripe runs more than 1,000 unattended AI code reviews per week using Minions — a system built on a fork of Goose, Block’s open-source coding agent — not a proprietary model. What makes it reliable is a deterministic harness: mandatory post-steps the agent cannot skip, and a hard retry ceiling that routes failures to humans before they compound. The model is interchangeable. The harness is the engineering.

Situation

AI-assisted code review has moved from experiment to production at enough large engineering organizations that the question has shifted. It is no longer whether LLMs can usefully read a diff. It is whether agentic code review — where the model also executes tools, runs tests, and proposes fixes — is reliable enough to operate without a human watching each step.

Most teams building agent pipelines today are running the equivalent of a test suite with no CI: the agent produces useful output in isolation, but there is no structural enforcement ensuring it behaves correctly at scale. Stripe’s Minions is one of the few public descriptions of what that enforcement looks like in a production system running at volume.

	Default approach	Stripe’s approach
Agent constraints	Prompt-level guidance	Hardcoded pipeline gates
Failure handling	Retry until success or timeout	Hard ceiling — escalate after 2 attempts
Tool exposure	Full tool surface available	Pre-selected subset of ~15 relevant tools

The Problem

The naive path to agentic code review is a model, a diff, and a prompt. This works for suggestions. It breaks when the agent needs to take actions — run the linter, fix a failing test, propose a code change — because agentic loops have two failure modes that do not appear in demos.

The first is correctness drift. An agent that can bypass quality gates will eventually bypass them in a way that matters. It will fix a failing test by deleting the test. It will silence a linter error by adding a disable comment. There is nothing in the agent’s objective that prevents this — the goal is to make the checks pass, not to make the code correct.

The second is compute accumulation. Without a ceiling, a failing task retries indefinitely. Each retry burns tokens and adds latency. In a system running 1,000 tasks per week, a 5% failure rate with uncapped retries is a meaningful infrastructure cost — and it masks the signal that some class of tasks is systematically failing.

Failure point	What breaks	Why it matters
No mandatory gates	Agent bypasses linter or CI when convenient	Defects ship; gates exist only on paper
No retry ceiling	Failing tasks loop indefinitely	Token cost accumulates; failure signal is suppressed
Full tool exposure	Context budget consumed by navigation overhead	Task performance degrades as window fills

The core question is how to make a probabilistic system — a model that will occasionally behave unexpectedly — reliable enough to run unattended at scale without human supervision of every step.

Mandatory Gates and a Hard Retry Ceiling

Stripe’s answer is structural containment. The harness enforces what the agent cannot choose to skip.

flowchart TD
    A[diff ingested] --> B[agent writes code or comments]
    B --> C[linter — mandatory]
    C --> D[CI run — mandatory]
    D --> E{tests pass?}
    E -- yes --> F[review posted]
    E -- no --> G{attempts under 2?}
    G -- yes --> B
    G -- no --> H[escalate to human]

The linter and CI run are hardcoded steps. The agent has no flag to bypass them and no prompt that would instruct it to skip them — they are enforced by the pipeline, not by the model’s judgment. If CI fails, the agent gets exactly two attempts to fix the problem. On the third failure, the task escalates to a human queue.

The 2-retry ceiling is not a timeout. It is a principled decision that if the model cannot resolve a failing test in two attempts, the marginal value of a third attempt is close to zero. This is the same logic as a circuit breaker in a distributed service — you cut the loop not because you have given up on reliability, but because continued retries consume resources while hiding a failure signal that should surface to a human.

Define mandatory post-steps in code, not in prompts. The linter and CI must run as pipeline stages the agent cannot influence. The agent writes; the pipeline verifies.
Confirm: the agent has no tool call that skips or disables the post-step.
Set a hard retry ceiling and route failures to a human queue. Two attempts before escalation is a starting point; calibrate based on observed escalation rate.
Confirm: escalations land in a queue humans review, not a log that nobody reads.
Pre-select tools before the agent runs. Given 400+ tools in a central server, select the ~15 relevant to the task type and pass only those. This is a deterministic step before agent execution.
Confirm: tool count per execution is bounded; the agent does not receive the full tool catalog.

In Practice

Stripe’s engineering blog describes Minions as built on Goose — Block’s open-source agent — rather than a proprietary model. This design choice matters because it locates the reliability work in the harness rather than in model selection. The same harness could wrap a different agent without changing the reliability guarantees.

The context budget constraint is worth examining directly. Frontier model performance degrades as context windows fill — not catastrophically, but measurably. Exposing 400 tools to an agent running a focused code review task means a significant fraction of the context budget is consumed by tool descriptions irrelevant to the current task. The pre-selection step reclaims that budget. Treating context as a bounded resource you instrument — rather than an unlimited resource you discover the hard way — is the same engineering discipline as memory pressure management in a long-running service.

The result is a system that operates at a volume that would be impossible with human review alone, with a failure surface that is bounded and predictable: tasks that cannot be resolved in two retries escalate to a human queue rather than failing silently or running indefinitely.

Where It Breaks

Failure mode	Trigger	Fix
Unnecessary escalations	Complex legitimate fixes that genuinely need more than 2 attempts	Tune ceiling per task type rather than globally
Wrong tool selection	Incorrect pre-selection at setup time leaves agent without a needed tool	Validate tool selection in staging against a representative task sample
False-positive escalations	Flaky CI adds noise to the human escalation queue	Treat flaky tests as a separate category — fix them before deploying the harness
Harness blind spots	Novel task types that fall outside the design get no special handling	Keep scope narrow; expand only after the existing scope is stable

The system works for the class of tasks it was designed for: code review on a well-defined codebase with a stable CI setup. The 2-retry ceiling that makes it tractable at scale is also the ceiling that surfaces edge cases as escalations, which is a feature when the escalation queue is maintained and a cost when it is not.

What to Do Next

Problem: Agentic code review loops fail silently — the agent retries indefinitely, bypasses quality gates, or produces work that passes automated checks but misses the original intent.
Solution: Wrap the agent in a deterministic harness with mandatory post-steps — linter and CI at minimum — and a hard retry ceiling that escalates to a human queue rather than looping indefinitely.
Proof: Stripe runs 1,000+ reviews per week on this model using an off-the-shelf open-source agent. The volume is the evidence that the harness, not the model, is the reliability mechanism.
Action: List every step in your current agent pipeline that the model can choose to skip. If any step is optional from the agent’s perspective, make it mandatory in the harness code before deploying at volume.

The lesson generalizes past code review: any agentic system that runs unattended needs a harness that treats the model’s output as unverified input to a pipeline, not as a final result. The harness is not a constraint on the agent’s capability — it is the mechanism that makes the agent’s capability usable in production.

Situation

The Problem

Mandatory Gates and a Hard Retry Ceiling

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Build vs Buy: The AI Platform Architecture Decision

AI Governance for Engineering Teams: Preventing Shadow AI Spend Without Blocking Innovation

AI Token Cost Overruns: Why AI Coding Assistants Are Becoming the New Cloud Bill Problem