The Policy OS Contract Bundle

Executive Thesis

An agent is not an operating model.

A connector can expose useful capability. A workflow builder can make an interaction easy to inspect. A coding agent can work inside a repository. An SDK-backed service can route tools, store state, pause for approval, and emit traces.

Those surfaces matter, but they do not answer the practical questions a team needs before giving AI real work:

What systems may this workflow read?
What actions may run without approval?
What actions must pause for a named owner?
What actions are out of scope?
What counts as a successful outcome?
What evidence proves the workflow behaved correctly?
What changes when the workflow moves to a different agent surface?

The answer is the Policy OS contract bundle: a small, explicit artifact family that travels with the workflow.

The bundle is not documentation after the fact. It is the operating boundary. It defines the workflow before autonomy expands, keeps the workflow portable across agent clients, and gives humans a concrete object to review when the system changes.

The practical recommendation is simple: before asking which agent should run the work, ask whether the work has a contract bundle.

What This Paper Gives You

Use this paper when an AI workflow is past the demo stage but not yet safe to treat as an operating system.

It gives you three practical outputs:

A way to explain why "we connected the tools" is not the same as "the workflow is governed."
A five-artifact bundle that turns access, behavior, success, regression testing, and operations into reviewable objects.
A migration rule for moving work between Dify, MCP servers, repo-owned services, coding agents, and SDK-backed orchestration without losing the policy boundary.

The target reader is not only an engineer. It is also the operator, founder, product lead, or client sponsor who needs to know whether an AI workflow can be trusted with real work.

Why Agent Projects Drift

Most agent projects drift because the team chooses a surface before it names the work.

The surface can be impressive:

a chat app connected to business systems
a Dify workflow with MCP server cards
a Codex setup for repository work
a Claude or Cursor configuration for local development
an SDK-backed service with durable state and traces

But the same failure mode appears across all of them.

The agent can do something, but nobody can say exactly what it is allowed to do.

The team has tool access, but not a tool boundary. It has prompts, but not policy ownership. It has a demo, but not regression cases. It has a user interface, but not an escalation model. It has a deployment, but not a rollback path.

When the workflow breaks, the debugging conversation gets vague:

Was the data missing?
Did the tool fail?
Did the model choose the wrong action?
Was the policy too permissive?
Did the operator approve the wrong thing?
Did the runtime surface stop matching the workflow?

Without a contract bundle, those questions collapse into one unhelpful conclusion: the agent did not work.

Policy OS separates them.

The Bundle In One Page

Every governed AI workflow should ship with five core artifacts.

Artifact	Job
`mcp_contract.yaml`	Defines tools, resources, prompts, auth scopes, schemas, and error behavior.
`agent_contract.yaml`	Defines allowed tools, approval mode, escalation triggers, guardrails, runtime surface, and graduation status.
`outcome_contract.md`	Defines the job to be done, success metrics, fallback path, ownership boundary, and review cadence.
`golden_tasks.yaml`	Defines repeatable examples that prove the workflow still behaves correctly.
`runbook.md`	Defines setup, operation, incident response, rollback, and human handoff.

This bundle is small enough to write for one workflow and strong enough to govern real work.

The bundle is finished when a reviewer can answer five questions without reading the implementation:

What may this workflow access?
What may it do?
When must it ask?
What result is it trying to produce?
What evidence proves the behavior still holds?

It also changes the buyer conversation. The question stops being "Do you have an AI agent?" and becomes:

Which workflows are approved?
Which tools are available?
Which writes need approval?
Which failures stop the system?
Which tests prove the behavior?
Which runtime currently owns the workflow?
Which evidence supports graduation or rollback?

Those questions are more useful than a demo because they expose whether the system can survive contact with operations.

The MCP Contract: What Exists

The mcp_contract.yaml is the connectivity boundary.

It describes what the workflow can see and call:

tool names
resource URI patterns
prompt IDs
auth scopes
input and output schemas
error model
ownership of external connections

This contract is where MCP earns its place in the architecture. MCP makes capability explicit. It gives agent clients a typed way to discover tools, resources, and prompts instead of hiding integration logic inside a general-purpose prompt.

But the MCP contract should not be treated as the whole system.

MCP answers what can be exposed. The rest of the bundle answers what should happen with that exposure.

For a support workflow, the MCP contract might expose tools to read tickets, search customer history, draft replies, post replies, and update account tags. That tool list is necessary. It is not enough.

The governing question is not "Can the agent post a reply?" It is "Under which policy, for which class of ticket, with which approval owner, and with what receipt?"

That answer belongs in the rest of the bundle.

The Agent Contract: What May Happen

The agent_contract.yaml is the behavior boundary.

It tells the agent which actions belong inside the workflow and which actions do not. It names:

allowed tools
blocked tools
approval mode
write-operation behavior
destructive-operation behavior
escalation triggers
budget and latency guardrails
policy artifacts
runtime surface
graduation status

The runtime fields matter because the same workflow may move across surfaces over time.

A workflow might start in Dify because the client needs visual editing, app publishing, Service API access, MCP server cards, and non-engineer inspection. The same workflow might later move one risky step behind a repo-owned service because it needs queues, tenant boundaries, durable state, custom endpoints, or package-local validation. A mature path might graduate a portion of orchestration into an OpenAI Agents SDK service when tool routing, approval pauses, traces, evals, and CI-backed golden tasks justify the platform burden.

The agent contract prevents that migration from becoming a silent rewrite.

It should always be possible to answer:

What is the current runtime surface?
Is the workflow still a prototype, Dify-first path, SDK candidate, SDK-graduated path, or rollback-required path?
Which platform affordances would be lost in a move?
Which evidence justifies the change?

Runtime choice is part of governance. It is not an implementation detail.

The Outcome Contract: What Success Means

The outcome_contract.md is the business boundary.

It names the workflow in human terms:

the target workflow
the user or operator
the manual fallback path
the success metrics
the ownership boundary
the review cadence
the customer-facing language

This contract prevents a common mistake: measuring the agent instead of measuring the workflow.

An agent can produce a polished answer while the workflow remains broken. A support reply can be well-written and still violate policy. A research brief can be fluent and still omit required evidence. A data-sync task can complete and still write to the wrong system.

The outcome contract says what matters outside the model.

For example, a billing triage workflow might define success as:

reads authorized email and account context only
classifies refund, deletion, payment, and escalation cases
drafts a reply without posting it
routes account deletion to human review
refuses to request or expose secrets
logs evidence for every proposed action
completes under a defined latency and cost budget

That is more useful than "the agent answers billing questions." It gives the system something to pass or fail.

Golden Tasks: What Must Stay True

The golden_tasks.yaml file is the regression boundary.

It captures the examples that must keep working after prompts, tools, models, policies, or runtimes change.

Golden tasks should cover:

happy paths
approval paths
blocked paths
missing-data paths
forbidden tool use
secret refusal
latency and cost budgets
required evidence
runtime parity when a workflow is migrating

Golden tasks make governance testable.

They also make runtime graduation honest. If a Dify workflow is working for operators, an SDK-backed path should not replace it merely because code feels more powerful. The new path should prove parity where parity matters and improvement where improvement is claimed.

If the SDK path improves traceability but loses non-engineer inspection, that is a governance tradeoff. If it improves routing but makes rollback unclear, that is not a graduation. If it improves latency and preserves approval behavior, that evidence belongs in the contract.

Golden tasks keep those claims from becoming taste.

The Runbook: How Humans Stay In Control

The runbook.md is the operating boundary.

It explains how humans run the workflow when everything is normal and what they do when the system stops.

A useful runbook includes:

setup steps
required secrets and where they live
smoke commands
approval workflow
incident response
rollback steps
owner contacts
review cadence
known failure modes
evidence locations

This matters because governed AI is not just a system behavior. It is an operating cadence.

Someone has to review blocked actions. Someone has to tune policy. Someone has to notice golden-task drift. Someone has to decide when a workflow deserves more autonomy or less.

The runbook keeps that human control explicit.

Portability Is The Point

The contract bundle is portable by design.

The same workflow may need to run through different surfaces:

Surface	Role
Codex	Primary setup, repository work, task execution, policy-aware implementation.
Pi	Installable skills, prompts, extensions, and quality gates for coding agents.
Claude Code or Cursor	Repo-local harnesses that can consume the same MCP and policy artifacts.
Dify	Client-facing visual workflow UX, publishing, MCP server cards, and operator inspection.
Cloudflare or repo-owned services	Durable state, queues, auth, tenant boundaries, recovery paths, and validation.
OpenAI Agents SDK	Code-owned orchestration when approval pauses, traces, evals, and CI-backed golden tasks are worth the burden.

Portability does not mean every runtime is equivalent.

It means the workflow's governing artifacts should survive movement across runtimes.

The MCP contract keeps capability portable. The agent contract keeps behavior portable. The outcome contract keeps success portable. The golden tasks keep proof portable. The runbook keeps operations portable.

Without those artifacts, changing surfaces becomes a rebuild. With them, changing surfaces becomes a governed migration.

What Proof Looks Like

The contract bundle should not remain a theory.

A serious implementation leaves behind proof in several forms:

template workflows that declare which tools are enabled and which writes require confirmation
inventory records that name the live agent surface, smoke cases, and MCP server cards
contract bundles for concrete scenarios such as deduplication, inbox triage, or fleet reliability
smoke runners that can execute those scenarios or at least verify connectivity
installable agent packages that carry skills, prompts, and quality gates into developer workspaces
policy evaluators that compile and test the same constraints outside a prompt

Those artifacts do not all need to be public. But they need to exist.

Without them, the team is trusting a claim. With them, the team can inspect the operating boundary:

this is the workflow
this is the tool surface
this is the approval rule
this is the test case
this is the runtime owner
this is the rollback path

The important shift is from "the agent is configured correctly" to "the workflow has inspectable evidence."

Dify-First Does Not Mean Dify-Only

For many client workflows, Dify is the right front door.

It gives teams visual editing, app publishing, Service API access, MCP server cards, and a surface non-engineers can inspect. Those are not minor conveniences. They are operator affordances.

Policy OS treats those affordances as part of the contract.

A workflow should graduate from a Dify-first path only when repo-owned runtime control is more valuable than platform-managed editing speed. That usually means the workflow needs one or more of these:

code-owned orchestration
explicit tool routing
approval pauses
durable state
traces
evals
CI-backed golden tasks
repeatable cost controls
custom recovery paths

Even then, the recommended pattern is usually not "replace Dify."

The better pattern is:

Keep Dify as the client-facing entry point.
Move only the expensive, risky, or orchestration-heavy step behind a service.
Run the same golden tasks against both paths.
Promote only when behavior, cost, latency, reliability, or operator visibility improves.

That is runtime graduation under policy, not platform switching by preference.

The Three-Tier View

The Policy OS contract bundle maps directly to the Three-Tier Framework.

Database: what exists

The Database layer contains the state the workflow depends on:

business records
tool inventories
resource URIs
auth scopes
entitlement state
policy versions
contract versions
trace and evidence logs

The MCP contract lives closest to this layer because it defines available substrate.

Automation: what happens

The Automation layer contains execution:

MCP tool calls
Dify workflow steps
Cloudflare endpoints
queues
webhooks
SDK agent routing
smoke scripts
eval runs

The agent contract and golden tasks live close to this layer because they define and test allowed behavior.

Judgment: what should happen

The Judgment layer contains policy:

approval mode
escalation policy
blocked-state behavior
review cadence
operator decision rights
rollback criteria
graduation criteria

The outcome contract and runbook live close to this layer because they define why the workflow exists and how humans stay in control.

The bundle matters because it keeps those tiers from blending into one prompt.

A Practical Adoption Path

The first contract bundle should be narrow.

Do not start with an enterprise-wide agent governance program. Start with one workflow where the handoff is painful and the approval boundary is visible.

A good first candidate has:

a real operator
a repeatable input
a known manual fallback
a measurable output
at least one approval boundary
at least one safe read-only path
clear evidence of whether the workflow succeeded

Then write the bundle in this order:

Outcome contract: name the job, owner, success metric, and fallback.
MCP contract: define the tools, resources, prompts, auth scopes, and errors.
Agent contract: define allowed actions, blocked actions, approvals, guardrails, runtime surface, and graduation status.
Golden tasks: capture happy, approval, blocked, and failure examples.
Runbook: document operation, review, incident response, rollback, and evidence.

This sequence keeps the work grounded. The business outcome comes before tool exposure. Tool exposure comes before agent behavior. Agent behavior comes before regression testing. Regression testing comes before recurring operation.

A Starter Bundle Example

Imagine a customer inbox triage workflow.

The weak version says: "Use AI to help with support."

The contract-bundle version says:

Artifact	Example content
`outcome_contract.md`	The workflow classifies inbound customer messages, drafts responses, and routes refund, deletion, legal, and security cases to named humans. It never sends customer-facing mail without approval.
`mcp_contract.yaml`	The workflow can read authorized inbox threads, read customer account status, search approved help content, draft a reply, and create an internal handoff note. It cannot issue refunds, delete accounts, or send email directly.
`agent_contract.yaml`	Low-risk classification is auto-allowed. Drafting is auto-allowed. Sending, refunding, deleting, exporting, and policy exceptions are approval-needed or blocked. Missing account context stops the workflow with a reason.
`golden_tasks.yaml`	Examples cover normal support, refund request, account deletion, missing customer record, suspicious attachment, angry but non-policy-breaking message, and secret exposure attempt.
`runbook.md`	The support owner reviews blocked cases daily, checks golden-task drift after prompt or tool changes, and falls back to manual triage if the inbox connector, account lookup, or approval queue fails.

That small bundle communicates more than a demo because it makes the operating boundary visible.

It tells an operator what will happen. It tells an engineer what to expose. It tells a reviewer what to test. It tells a buyer what evidence to request.

What To Ask Vendors

If you are evaluating an AI workflow vendor, ask for the contract bundle.

Ask:

Show me the MCP contract or equivalent capability inventory.
Show me which tools are allowed, blocked, and approval-gated.
Show me the outcome contract for the first workflow.
Show me the golden tasks that prove behavior after a prompt, model, tool, or runtime change.
Show me the runbook for incidents and rollback.
Show me the current runtime surface and graduation status.
Show me the evidence that justifies any move from visual workflow tooling into custom runtime code.

These questions are concrete enough to separate productized governance from a demo.

A good answer may still be lightweight. The first bundle does not need to be large. It needs to be explicit.

A weak answer usually sounds like confidence without artifacts:

"The agent knows not to do that."
"We can add approvals later."
"The prompt handles it."
"The workflow can be moved to code when needed."
"The logs are somewhere in the platform."

Those answers may be true in a narrow demo. They are not enough for governed work.

Conclusion

The useful unit of governed AI work is not the model, the prompt, the connector, or the workflow canvas.

It is the contract bundle.

The bundle makes capability explicit, behavior inspectable, outcomes measurable, regressions repeatable, operations recoverable, and runtime migration governable.

That is why Policy OS starts with MCP but does not stop at MCP. Connectivity establishes trust boundaries. Skills and agents provide behavior. Contracts, golden tasks, runbooks, and review cadence turn that behavior into an operating system.

Before expanding autonomy, write the bundle.

Before graduating runtime, test the bundle.

Before trusting the agent, inspect the bundle.