
A few days ago I watched an agent do something that perfectly captured the problem.
It wasn’t an attack. It wasn’t “AI gone rogue.” It was just normal agent behavior under ambiguity.
I asked it to fix a failing build and clean up whatever was causing it. The steps looked reasonable. The log looked clean. The diff looked tidy. And then a config that should have been treated as an invariant got “cleaned up” too.
We had perfect visibility into what happened.
We did not have control over whether it could happen.
That is the mismatch I want to name in this post.
In the last post I described the control gap: agents operate at machine speed, oversight operates at human speed, and the gap between those two keeps widening.
This post is about where that gap actually lives - and why the controls we reach for first keep failing once agents have real tools.
A concrete model: three planes of control
Most teams try to govern agents using one of two approaches:
- shape intent before the run (prompts, policies, best practices)
- explain what happened after the run (logs, traces, postmortems)
Those are necessary.
But they are not the missing piece.
The missing piece is the middle: control during execution, at the moment actions are taken.
Here is the simplest model I’ve found that maps cleanly onto reality:
1) Intent controls (before)
What we want the agent to do.
- prompts and system instructions
- “rules” written in natural language
- developer conventions
- guardrails that exist only as text
Intent controls influence behavior. They do not constrain capability.
2) Execution controls (during)
What the agent can actually do.
- which tools exist (shell, filesystem, network, APIs)
- what those tools are allowed to touch (paths, domains, accounts, environments)
- what permissions the agent runs with (credentials, tokens, access scopes)
- which actions require escalation (and what escalation looks like)
Execution controls are the difference between “the agent shouldn’t do that” and “the agent can’t do that.”
3) Audit controls (after)
What happened, and how we learn / prove it.
- logs, traces, diffs, provenance
- forensics and incident response
- compliance, change management, accountability
Audit controls make systems governable. They do not make them safe.
If you only have prompts before and logs after, you don’t have control. You have hope and hindsight.
Why agents break the assumptions behind traditional control
Traditional software is easier to govern because most of its behavior is deterministic. Even when systems are complex, the runtime behavior is constrained by code paths we can reason about.
Agentic systems break that assumption.
The “program” is not a fixed code path.
The agent is making decisions probabilistically, in a loop, based on whatever context it is reading and whatever tool output it just saw.
And increasingly, agents do not just recommend actions.
They take them.
Once an agent has tools, it has real hands:
- it can touch the filesystem
- it can run processes
- it can fetch or exfiltrate data over the network
- it can mutate state in external systems via APIs
At that point, bad outputs matter less than bad actions.
Why intent controls fail once agents have real tools
When teams say “we’ll add guardrails,” they usually mean “we’ll add more instructions.”
That works surprisingly well for assistive systems, where the output is text and the user is still the actuator.
It breaks down when the system is the actuator.
1) Agents don’t just follow the prompt - they follow the prompt plus everything around it
Agents don’t only consume your instruction. They consume:
- READMEs and docs
- tickets and PR comments
- web pages
- tool output, logs, stack traces
Security boundaries often rely on separating instructions from data.
Agents are designed to blur that line. They are built to treat text as actionable context. That is what makes them helpful.
It is also what makes them vulnerable.
Untrusted text can steer how delegated authority is exercised. (If you like formal names for problems, this starts to look a lot like the "confused deputy" pattern in complex tool chains.)
2) “Policy in natural language” is not policy
A policy that is not enforced at the point of action is not a policy.
It is a suggestion.
You can write:
- “never delete important files”
- “don’t exfiltrate secrets”
- “confirm before running dangerous commands”
…and still get a destructive outcome because the agent made a technically plausible assumption:
- “this looked like generated output”
- “this token seemed like a test key”
- “this directory seemed safe to reset”
- “this refactor seemed equivalent”
The failure mode is rarely malicious.
Often it is technically reasonable.
But the outcome can still be painful.
3) Intent controls don’t compose
Even if each instruction is individually reasonable, agents compose them under time pressure.
“Fix the failing test.”
“Clean up whatever is causing it.”
“Update the dependency.”
“Remove unused config.”
Those are normal tasks. In combination, with broad privileges, they can create an unsafe path.
The risk is not one bad instruction. The risk is a plausible chain of small decisions executed quickly.
Why after-the-fact controls fail as prevention
Auditability matters. It’s essential for debugging, governance, and compliance.
But it doesn’t solve the core risk:
- logging a credential read doesn’t undo the credential read
- tracing a destructive API call doesn’t roll it back
- a perfect postmortem doesn’t restore what was lost
Audit controls answer “what happened?”
Execution controls answer “can this happen?”
A lot of current “agent safety” is basically trying to use observability as a substitute for constraint.
It’s valuable. It is not enough.
So what does control during execution actually mean?
Execution-time control is not mystical. It is classic operational security applied to agentic workloads.
It means:
-
capabilities are explicit
What tools exist? What actions are possible? -
capabilities are scoped
What can those tools touch? Which paths, which domains, which accounts, which environments? -
capabilities are enforced at runtime
Not in a README. Not in a prompt. In the actual execution environment. -
escalation is real
There should be meaningful boundaries between “safe” and “dangerous,” and crossing those boundaries should be deliberate.
One concrete example: an agent might be allowed to run shell commands and modify files inside a repository, but unable to make outbound network requests unless the destination domain is explicitly allowlisted.
That is the shift:
- from “tell the agent not to do X”
- to “make X mechanically hard (or impossible) unless explicitly intended”
That is what it means to put guardrails at the point of execution.
Not more text.
More constraint.
What comes next
In the next post, I’m going to get much more concrete.
We’ll define a simple risk taxonomy for agent actions (read-only → reversible writes → destructive ops → external side effects), and walk through the execution-time guardrail patterns that actually reduce failures without killing velocity.
Because the real question underneath all of this remains the same:
How do we let agents move at machine speed without forcing humans to surrender control?
That is the control gap.
And the only place it closes is the execution layer.
AgentSH
As part of this work, we're building AgentSH, an open-source runtime exploring execution-time controls for agentic workloads.
If you’re running agents in dev, CI, or production, I’d love to hear:
- what is the action you most want a seatbelt for?
- what’s the failure mode that surprised you most?
- where do you feel like you have "hope and hindsight" instead of control?
Built by Canyon Road
We build Beacon and AgentSH to give security teams runtime control over AI tools and agents, whether supervised on endpoints or running unsupervised at scale. Policy enforced at the point of execution, not the prompt.
Contact Us →