How I Think About Security When Agents Write the Code

Allowing agents to make decisions with insufficient policy, fuzzy scope, and absent audit trails fundamentally undermines our trust infrastructure. These are, by definition, major trust breaches. I laid out this problem in Part 1 with real exploits, real blast radii, and a pattern that keeps repeating: the industry ships agent capabilities faster than it ships agent security. The instinct, understandably, is to put a team in the loop, add a review gate, require approval before the agent touches anything sensitive. Whilst well-intentioned, that instinct often focuses on the wrong failure mode, and at agent-speed output volumes, it quietly degrades into security theater.

The review bottleneck

When we scaled our agentic coding workflows, our review process didn’t break dramatically; it degraded quietly. Reviewers started pattern-matching instead of reading. Approvals came faster as queue pressure trained teams to skim. The review gate was still there, still green-lighting every merge, but the rigor behind it had hollowed out. We ended up with unreviewed code in production wrapped in the comforting fiction of a process that was followed.

We hit this wall ourselves within weeks of scaling our agentic coding workflows. The agents were producing correct, functional code at a pace that made traditional pull request review absurd. Our senior engineers were spending their days reading code that was, in most cases, fine, whilst the cases that weren’t fine looked fine too, because the patterns were unfamiliar.

Agents don’t make the same mistakes engineers make; they make mistakes engineers aren’t trained to spot.

The principle we landed on is to match the reviewer to the speed and nature of the code: AI-generated code needs AI-generated adversaries.

Adversarial review at agent speed

The concept is simple in principle and tricky in execution. A separate agent, with a different system prompt, different context, and no access to the original implementation instructions, receives the code and tries to break it. What matters is whether the code fails safely under conditions the implementation agent never considered, not whether it compiles and passes its own tests.

We structure it as a red team. The review agent’s instructions are explicitly adversarial: find injection vectors, test for over-scoped permissions, probe for data leakage across trust boundaries, and verify that failures are contained rather than cascading. It’s the confused deputy audit, automated and running on every commit.

The critical design constraint is separation of context. If the review agent has access to the implementation agent’s instructions, it will rationalize the same assumptions. We treat the review agent the way we’d treat an external penetration tester: here’s the artifact, here’s what it’s supposed to do, now find what’s wrong with it.

This exists today

If this sounds like a lot of custom infrastructure, the tooling is already catching up. Claude Code ships a built-in /security-review command that runs adversarial security analysis directly in your terminal, before you commit. You run it, it scans your pending changes for SQL injection, XSS, authentication and authorization flaws, insecure data handling, and dependency vulnerabilities, then explains what it found and why it matters. The same analysis is available as a GitHub Action that runs on every pull request in CI, which means the adversarial review layer we described above can be operational in your pipeline within minutes, with no custom agent orchestration required.

During Anthropic’s internal testing, /security-review caught a DNS rebinding vulnerability in a local HTTP server that would have enabled remote code execution, and flagged an SSRF exposure in a credential management proxy, both before the PRs merged. These are exactly the kind of findings that slip past manual reviewers under queue pressure: the code looks reasonable, the tests pass, and the vulnerability hides in the interaction between components rather than in any single function.

This is the entry point for teams that want to start building adversarial review into their workflow today. The custom multi-agent architecture we’re building handles more complex scenarios, cross-service trust boundaries, multi-step tool chains, and agent-to-agent interactions, but /security-review covers the foundation that most teams are still missing entirely.

Scenario holdouts: tests the agent never trained against

There’s a deeper problem with agent-written tests. An agent that writes both the implementation and the test suite can trivially satisfy its own criteria. The most obvious failure is assert true, but the subtle version is worse: tests that are structurally correct but test only the paths the implementation agent already optimized for. The code passes every test because the tests were born from the same reasoning that produced the code.

We borrowed a concept from machine learning: holdout sets. Our test scenarios live outside the codebase, maintained separately, invisible to the implementation agents. They’re end-to-end user stories that describe what the software should do in plain language, and a separate validation agent executes them against the built artifact. Code that passes scenarios it was never exposed to during development is real evidence of correctness, and when it doesn’t, the failures are informative rather than circular.

This separation of concerns is easy to skip and hard to retrofit, which is why it needs to be a first-class design decision from the start. The holdout principle isn’t new; ML engineers have understood for decades that you can’t evaluate a model on its training data. The same logic applies when agents write code.

Digital twins for blast radius containment

Part 1 documented what happens when a confused deputy hits production: the Supabase MCP agent that read a private integration_tokens table and wrote it into a user-visible support thread. That agent was running against real services with real credentials, so the blast radius was total, because the testing environment was production.

We’re building behavioral clones of the services our agents interact with, full API replicas that go well beyond traditional mocks to simulate edge cases, failure modes, permission boundaries, and rate limits. Agents under test run against these twins, never against production. A confused deputy attacking a digital twin exfiltrates fake data from a fake database to a fake communication channel, preserving a realistic attack surface with a contained blast radius. The fidelity gap is the risk you’re trading for: if the twin doesn’t replicate a permission boundary correctly, you get a clean test against a surface that doesn’t match production. Building twins that are accurate enough to catch real vulnerabilities, rather than just plausible enough to pass, is where most of the engineering effort goes.

The economics of this approach changed dramatically in the past year. Building a high-fidelity API clone used to require weeks of engineering time that we couldn’t justify. Now an agent can ingest the API documentation and produce a working behavioral clone in ~~hours~~ minutes. We can test permission escalation paths, race conditions, and cross-service data leakage at volumes that would get us rate-limited or banned on the real APIs.

The ledger as the security model

Agent security needs to prioritize observability over prevention, because the attack surface is the agent’s own reasoning process. In practice, that means every agent action, every tool call, every input and output, logged to an append-only, immutable store. This log serves as the actual security layer that the rest of the system depends on, the foundation that makes every other defense verifiable.

When a review agent flags an anomaly, the audit trail provides the full decision chain: what the agent saw, what it decided, what it called, and what came back. A failed scenario holdout reveals exactly where the reasoning diverged from expected behavior, and if a digital twin catches a permission escalation, the captured sequence lets us patch the policy, not just the symptom.

The principle is older than any of these tools: write-ahead logs, event sourcing, WORM storage all exist because systems that can’t explain their past can’t be trusted with their future. Agent systems need the same primitive, applied to a new substrate. Every tool call, every input transformation, every output, captured in an append-only record that outlives the session that produced it.

What “done” means now

We’re actively making this transition, and it changes more than the toolchain. It changes the definition of “done.” A feature is complete only when the adversarial reviewer couldn’t break it, the holdout scenarios passed without exposure, and the audit trail is clean. Engineers on the team are shifting from line-by-line code review to designing the review systems, the scenarios, the twin architectures, and the policies that govern what agents are allowed to do.

We now define ‘done’ as the point where the adversarial reviewer couldn’t break it, the holdout scenarios passed without exposure, and the audit trail is clean.

The security model the agent era requires is trust infrastructure that runs at the same speed as the agents it governs, designed from the ground up rather than bolted on after the fact. We’re building it because the alternative is another integration_tokens incident, another agent with production credentials and no guardrails, another team learning about a data leak from a customer support ticket instead of an audit log.