Governance for AI agents

A practical guide for engineers and architects evaluating how to keep autonomous agents safe in production. What governance actually means at the action layer, the four layers it composes from, the failure modes each layer catches, and how to evaluate the tooling.

By Louis Bryson · 12 min read · Updated

What AI agent governance actually means

An AI agent is an LLM wrapped in a loop that decides which tools to call, with what arguments, in what order, until a goal is reached. The model chooses; the runtime executes. Governance for AI agents is the set of constraints, controls, and audit mechanisms that decide which of those choices are actually allowed to execute — which tools, with which arguments, by which agent, on which data, within which budget, spawning which sub-agents.

The phrase is sometimes used interchangeably with AI safety or AI alignment, but those are research problems about model behaviour. Governance is an engineering discipline about action authorisation at runtime. It treats the agent the way a platform team treats any other privileged caller in the system: as something whose actions must be authorised, logged, and reviewable.

Three things have made this urgent in 2026. First, agents now routinely touch production systems — code repositories, internal APIs, payment rails, infrastructure, customer data — instead of just chat. Second, MCP and similar standards have made tools portable, so the same agent can invoke dozens of external services with no central control plane. Third, regulation is catching up: the EU AI Act, ISO 42001, and SOC 2's expanded AI guidance all require evidence of who decided what, why, and on what policy — evidence that prompt-engineered safety cannot produce.

Why prompts are not governance

The default pattern for "safe" agents looks like this: a long system prompt describes what the agent is and isn't allowed to do, plus a hope that the model will follow it. That works until the input shifts, the model is updated, or someone writes "ignore previous instructions". The constraint lives inside a probabilistic system, in natural language, with no audit trail explaining any specific decision.

For consumer chat that's a workable trade-off. For an agent that touches a production database, sends money, deploys infrastructure, or acts on a user's behalf, it isn't. Production governance moves the allow / deny / require-approval decisions out of the prompt and into deterministic infrastructure, where they are reviewable, testable, and reproducible.

The four layers of agent governance

Production agent stacks compose four governance layers. Each catches a different class of failure; none of them is sufficient on its own.

1. Prompt layer

The system prompt and tool descriptions define what the model believes it is and isn't supposed to do. This is the cheapest layer to add and the easiest to bypass. It is useful for tone, refusal behaviour, and as documentation for the model — but constraints expressed here are unenforceable under adversarial input. Treat the prompt as a hint, not a boundary.

Catches: casual misuse, off-topic requests, basic refusal scenarios. Misses: prompt injection, jailbreaks, model drift across versions, anything where an attacker controls input.

2. Output filter layer

Validators that scan the model's response before returning it to the user or to a downstream tool. Tools in this category include Guardrails AI, LLM Guard, and provider-native content filters. Useful for PII redaction, toxicity scoring, schema validation of structured responses, and competitor-mention blocking.

Catches: bad text reaching users, malformed structured output, leaked secrets in responses. Misses: any failure that already happened by the time the model produced text — for example, a destructive tool call invoked mid-loop.

3. Action policy layer

Every tool call, sub-agent spawn, delegation, and resource request the agent attempts is converted into a structured event and evaluated against a policy engine before execution. This is where deterministic authorisation lives. The canonical implementation uses OPA and the Rego policy language — the same engine that runs Kubernetes admission control and cloud IAM authorisation in most production environments. Kite Logik sits in this layer for Python agents.

Catches: unauthorised tool calls, scope violations, budget overruns, runaway delegation, unapproved high-impact actions, cross-tenant data access. Misses: what the model says (that's the output filter's job) and where the network traffic ultimately goes (that's the network layer's job).

4. Network egress layer

Outbound traffic from the agent runtime is routed through an egress proxy or service mesh that allowlists destinations, scrubs credentials, and meters bandwidth. This is the same pattern used for any sensitive service — agents just deserve it more, because their destinations are chosen at runtime by an LLM.

Catches: exfiltration attempts to unknown hosts, leaked API keys hitting attacker-controlled endpoints, sensitive data crossing tier boundaries. Misses: anything happening inside the agent process before traffic leaves.

Layer summary

LayerWhere it sitsDecision shape
PromptInside the model's context windowProbabilistic — the model decides whether to follow it
Output filterAfter the model responds, before downstream consumesDeterministic — text patterns, schemas, classifiers
Action policyBetween the agent loop and tool executionDeterministic — Rego policy on structured action events
Network egressOutbound network boundary of the agent runtimeDeterministic — host allowlists, mTLS, traffic policy

Failure modes governance must address

Concretely, what does governance prevent? The list below is drawn from production incidents and from the threat models in the OWASP Top 10 for LLM Applications and NIST AI RMF.

  • Tool misuse. The agent invokes a tool it shouldn't have access to, or passes arguments outside the allowed shape — writes to a path it can't write to, sends an email to an external domain, deletes a record without confirmation.
  • Delegation runaway. An agent spawns sub-agents that spawn sub-agents, each with the same or broader scope than the parent. Without a policy on delegation depth and scope-narrowing, one task balloons into thousands of recursive invocations.
  • Sensitive data exfiltration. The agent is tricked into summarising confidential data and sending it to an external tool, an attacker-controlled webhook, or a model provider in a different compliance tier.
  • Plan hijack via prompt injection. Untrusted text in a tool result (a web page, an email body, a database row) instructs the model to take an action the original user never asked for.
  • Budget overruns. A loop spends more tokens, more API calls, or more wall-clock time than was authorised — sometimes by orders of magnitude.
  • Missing audit trail. When something goes wrong, there is no reproducible record of what the agent attempted, what was allowed, what was denied, and on what policy version. Compliance frameworks treat this gap as the failure itself.

Implementation patterns

Mature production stacks share a small set of patterns:

  • Defence in depth, fail-closed. Multiple layers, each configured so that an evaluation error or service outage results in deny rather than allow. The cost of a false deny is a retry; the cost of a false allow can be unbounded.
  • Policy-as-code. The rules live in version-controlled files (typically Rego), reviewed via pull request, tested in CI, deployed with the rest of the application. Same workflow as the rest of the stack — see policy-as-code for AI agents for a deeper treatment.
  • Pre-flight plan validation. Where the agent produces an explicit multi-step plan, the plan itself is evaluated against policy before any step executes — so a plan with too many write operations or an unauthorised tool can be rejected as a whole rather than caught mid-execution.
  • Human-in-the-loop gates. Some actions don't fit allow/deny — they fit "require approval". The policy engine returns a third decision, the runtime pauses, and a human approves or rejects out-of-band.
  • Immutable audit trail. Every governed event — input, policy version, decision, latency, downstream effect — is recorded to append-only storage. This is what regulators, auditors, and incident responders actually consume.

The tooling landscape

Governance tooling for AI agents falls into four broad categories. Production systems combine multiple categories rather than choosing one.

  • Output validators Guardrails AI, LLM Guard, provider-native content filters. Strong on text-quality and PII concerns at the model output boundary; do not address what the agent does.
  • Dialogue rails NeMo Guardrails with the Colang DSL. Strong on conversational flow, topic restriction, and scripted refusal; less suited to enforcing infrastructure-level authorisation across many tools and agents.
  • Action policy engines — OPA-based stacks including Kite Logik, plus framework-specific middleware. Strong on deterministic authorisation of tool calls, spawn, delegation, budgets, and audit logging. See policy-as-code for AI agents and MCP governance for how this layer composes with the rest.
  • Observability platforms — LangSmith, Langfuse, Arize. These are adjacent rather than governance: they record what happened, but do not decide what is allowed. They are the consumer of the decision log, not its producer.

Evaluation checklist

Useful questions to ask when evaluating any AI agent governance tool:

  • Does it enforce decisions before the action executes, or only observe after?
  • Are decisions deterministic and reproducible, or do they depend on a model call?
  • Can the same policy govern tool calls, sub-agent spawn, delegation, and resource budgets?
  • Does it produce a structured, append-only audit log per decision?
  • Are policies version-controlled in a language with mature testing tools?
  • Does it fail closed on engine outage, evaluation error, or unknown input?
  • Can the same policy stack govern native tools and MCP tools identically?
  • What is the per-call latency overhead at p99, and is it bounded?
  • Does it integrate with the agent frameworks already in use, without rewriting tools?
  • Is the engine the same one your platform team already runs (e.g. OPA), or a new dependency?

Where Kite Logik fits

Kite Logik is an action-policy engine for Python AI agents. It plugs into the agent frameworks already in use — OpenAI Agents SDK, LangChain, LangGraph, CrewAI, Pydantic AI, and others — turns runtime events into structured policy inputs, evaluates them against your Rego policies, and emits an immutable audit log of every decision. It composes with output validators above it and with network-egress controls below it. Concrete Rego and Python examples live on the examples page.

Frequently asked questions

What is AI agent governance?

AI agent governance is the set of constraints, controls, and audit mechanisms that decide what an autonomous agent is allowed to do — which tools it can call, what arguments it can pass, which sub-agents it can spawn, what data it can move, and how much it can spend. Governance is enforced in infrastructure rather than expressed as instructions inside the model's prompt.

How is AI agent governance different from "AI guardrails"?

"Guardrails" usually refers to validators that operate on LLM input and output — checking prompts for injection patterns, scrubbing PII, blocking unsafe responses. Governance is broader: it covers the agent's actions (tool calls, spawn, delegation, data movement) regardless of what the model said. Guardrails answer "is this text safe?". Governance answers "is this action authorised?".

Do I still need governance if I am using OpenAI's built-in content filters?

OpenAI's content filters constrain the model's output. They do not know which tools your agent can call, what your authorisation rules are, what your token budget is, or whether the user has approved a high-impact action. As soon as your agent has tools, governance becomes a separate concern at the application layer.

What is the difference between prompt-level and action-level governance?

Prompt-level governance asks the model nicely — system prompts, instruction allowlists, tool descriptions. The constraint lives inside a probabilistic system and can be bypassed by adversarial input or model drift. Action-level governance intercepts every attempted tool call and evaluates it against deterministic policy code before execution. The action layer is where decisions become reproducible and auditable.

Is OPA suitable for AI agent policy?

Yes. OPA (Open Policy Agent) is a CNCF graduated project already used at scale for Kubernetes admission control, microservice authorisation, and cloud IAM. The same engine, the same Rego language, and the same decision-log machinery apply to AI agent actions — with the benefit that your platform team likely already runs OPA in production.

How does AI agent governance relate to MCP (Model Context Protocol)?

MCP standardises how agents discover and invoke tools. It does not standardise who is allowed to invoke what. Governance sits in front of the MCP client — every MCP tool call becomes a policy input, evaluated identically to a native tool call. MCP gives you portability of tools; governance gives you control over their use.

Can multiple governance tools be used together?

Yes, and for production agents you usually need to. Output validators, dialogue rails, action policy engines, and network-egress controls each catch different failure modes. The pattern is defence in depth: prompt-layer hints + output-layer scanners + action-layer policy + network-layer egress controls, each fail-closed.

Louis Bryson
Founder & maintainer, Kite Logik

Engineer focused on production AI agent infrastructure and policy-as-code. Maintains Kite Logik, the open-source OPA/Rego governance layer for Python agents.

Connect on LinkedIn