AI Agents Need a Control Plane, Not More Interfaces
AI agent platforms are converging on one useful primitive: a strong coding model operating inside a governed execution environment. The default approach is fragmented agent interfaces: one chat for coding, another for browser work, another for documents, another for scheduled jobs. The better alternative is an agent control plane: one permissioned runtime for files, tools, browsers, code repositories, and business artifacts.
Situation
The 2024 agent race looks noisy because every vendor is shipping new surfaces: OpenAI Codex, Claude Code, Cursor, OpenClaw, browser use, computer use, schedules, routines, dispatch, remote runs, and workflow-specific applications. Underneath the product sprawl, the architecture is becoming boring in the best possible way.
A coding model is no longer just a code generator. It is a general-purpose knowledge-work engine because code, SQL, spreadsheets, documents, slide decks, test traces, and browser sessions all reduce to structured artifacts plus tool calls.
| Fragmented agent interfaces | Agent control plane | |
|---|---|---|
| User experience | Different apps for code, docs, browser, schedules | Task-specific views over one runtime |
| Permissions | Repeated per tool | Central policy and approval gates |
| Observability | Scattered transcripts | One audit log across actions |
| Failure recovery | Manual reconstruction | Replayable job history and artifact diffs |
| Best fit | Individual experimentation | Production teams and regulated workflows |
The Problem
The failure is not that teams have too many chat boxes. The failure is that each chat box becomes a separate execution path with its own credentials, logs, filesystem assumptions, and review model. That is how a harmless “summarize this dashboard” workflow quietly becomes an unreviewed production automation path.
| Failure point | What breaks | Why it matters |
|---|---|---|
| Filesystem access | Agent edits repo, docs, and generated artifacts without a durable diff model | Incident response cannot prove what changed, when, or why |
| Browser use | Agent clicks through admin.internal.example.com like a human with no replay trace | “It submitted the form” is not an audit strategy |
| Scheduled jobs | Routines, remote runs, and dispatch execute the same primitive through different paths | Policy drift appears before anyone notices |
| Model routing | Frontier model handles one task, open model handles another, with no shared contract | Cost drops, but behavior becomes inconsistent |
| Tool-specific UX | Codex, Claude Code, Cursor, Warp, and internal tools all keep separate context | Engineers spend time reconciling agent state instead of reviewing output |
Modern models can infer nuance, fix typos, and handle vague intent better than skeptics expected. The production problem is different: autonomous agents still make expensive assumptions when the system does not define when they must ask for clarification. How do we govern agent execution paths so that an exploratory workflow does not quietly become an unreviewed production automation path?
Core Concept
The right architecture is an agent control plane: a single job model that routes requests into governed sandboxes, grants scoped tools, captures artifacts, and requires human approval at the boundary where risk changes.
flowchart TD
User[senior engineer] --> Intake[agent control plane — task intake]
Intake --> Classifier[classify — code, sql, browser, doc, schedule]
Classifier --> Policy[RBAC policy and approval rules]
Policy --> Sandbox[ephemeral workspace — repo checkout]
Sandbox --> Model[strong coding model]
Model --> FS[filesystem diff]
Model --> Browser[browser use or Playwright]
Model --> SQL[read-only PostgreSQL replica]
Model --> Docs[docs and spreadsheets]
FS --> Review[diff and artifact review]
Browser --> Replay[browser trace and screenshots]
SQL --> Evidence[query results and explain plans]
Docs --> Review
Review --> Approval[human approval gate]
Replay --> Approval
Evidence --> Approval
Approval --> Publish[merge, deploy, or schedule]
Publish --> Audit[immutable audit log]
- Define one job schema for every agent task.
{
"job_type": "browser_automation",
"repo": "payments-api",
"tools": ["filesystem", "browser", "playwright"],
"approval_required_for": ["submit", "delete", "purchase"],
"artifact_contract": "diff_plus_trace"
}
Verify: every task produces the same minimum record: prompt, tools granted, artifacts created, approvals requested, and final state.
- Treat browser and computer use as privileged automation.
Native browser control is useful for exploratory debugging. Playwright is better for repeatable continuous integration, meaning automated tests that run on every code change. Agentic browser use belongs between those modes: flexible enough to inspect unknown pages, constrained enough to produce screenshots, traces, and approval pauses.
Verify: any action that mutates data must have a replayable trace and a human approval checkpoint.
- Separate interaction layer from execution layer.
Warp, Cursor, Codex, Claude Code, and internal portals can all be front doors. They should not each invent a different security model. The execution layer owns sandboxing, credentials, logging, and rollback.
Verify: the same policy applies whether the task starts from a terminal, browser, chat panel, or scheduled job.
- Route models by risk, not fashion.
Frontier hosted models should handle ambiguous architecture changes, production debugging, and multi-artifact work. Smaller open models can handle scaffolding, search, formatting, and low-risk refactors. The control plane decides based on task class, data sensitivity, latency, and cost.
Verify: model choice is visible in the audit log and tied to an explicit task policy.
In Practice
Context: The documented pattern for agent deployment in shared environments is a unified control plane. Once more than one engineer uses autonomous agents against shared infrastructure, the primary operational question stops being “which agent is best” and becomes “who approved this action and what exactly did it change.”
Action: The minimum viable control plane for a small team relies on three invariant components: a job schema (what the agent may read, write, and call per task), an immutable record per run (prompt, tools granted, artifacts produced, approval decisions), and a strict policy for clarification before proceeding. SQL diagnostics should be restricted to read-only PostgreSQL replicas and standard views like pg_stat_statements, rather than production write connections. Browser actions on internal admin consoles require a human approval checkpoint before any submit or delete event. Everything else — model routing, sandboxed worktrees, artifact diffs — extends from those constraints.
Result: The first measurable gain is provenance, not speed. Debugging an agent-assisted system change becomes tractable because the immutable job record reliably answers the core operational questions: what the prompt was, which files were modified, which tools were called, and whether a human checkpoint was triggered before production state changed.
Learning: Vertical vendor stacks (e.g., Google AI Studio to Cloud Run, or Vercel’s v0 to production) are excellent when deployment friction is the primary bottleneck. The engineering tradeoff is architectural portability. A modular control plane costs more to build initially, but it ensures that model choice, system observability, and RBAC policy enforcement do not degrade into vendor-specific configuration understood by only one person on the team.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| Audit gaps | Agent has broad filesystem or browser access but only saves chat history | Store immutable job records, diffs, traces, screenshots, and approval decisions |
| False confidence | Evaluation checks only “task completed” | Add evals for permission adherence, rollback quality, artifact correctness, latency, and cost |
| Browser flakiness | Agent relies on visual clicking for a stable workflow | Convert repeated paths to Playwright tests with assertions and traces |
| Cost shock | Frontier models are used for every low-risk edit | Route simple tasks to cheaper hosted or open models with the same output contract |
| Permission drift | Schedules, routines, and remote jobs use separate configuration | Collapse them into one scheduler with shared policy |
| Bad assumptions | Agent proceeds when intent is underspecified | Require clarification when confidence is low or mutation risk is high |
What to Do Next
- Problem: agent tools are multiplying faster than teams can govern them.
- Solution: build one agent control plane for code, files, browser actions, SQL analysis, documents, and scheduled jobs.
- Proof: the same review model can cover a code diff, a browser trace, and a generated spreadsheet.
- Action: this week, define your internal agent job schema with filesystem scope, network scope, browser domains, credentials, approval gates, logging, rollback, and artifact review.