PAPER-2025-002

Harness Agent SDK Migration: Empirical Analysis

Security, Reliability, and Cost Improvements Through Explicit Tool Permissions

Case Study - 12 min read - Intermediate

Abstract

This paper documents the migration of the CREATE Something Harness from legacy headless mode patterns to Agent SDK best practices. We analyze the trade-offs between security, reliability, and operational efficiency, drawing from empirical observation of a live Canon Redesign project (21 features across 19 files). The migration replaces --dangerously-skip-permissions with explicit --allowedTools, adds runaway prevention via --max-turns, and enables cost tracking through structured JSON output parsing.

21/21
Features Complete
100
Max Turns Limit
0
Blocked Operations
~$0.50
Total Cost

1. Introduction

The CREATE Something Harness orchestrates autonomous Claude Code sessions for large-scale refactoring and feature implementation. Prior to this migration, the harness used --dangerously-skip-permissions for tool access—a pattern that prioritized convenience over security.

The Agent SDK documentation recommends explicit tool allowlists via --allowedTools. This migration implements that recommendation alongside additional optimizations.

1.1 Heideggerian Framing

Per the CREATE Something philosophy, infrastructure should exhibit Zuhandenheit (ready-to-hand)—receding into transparent use. The harness should be invisible when working correctly; failures should surface clearly with actionable context.

1.2 The Canon Redesign Project

The test project: removing --webflow-blue (#4353ff) from the Webflow Dashboard. This brand color polluted focus states, buttons, links, nav, and logos—43 violations across 19 files.

BeforeAfterSemantic Purpose
--webflow-blue (focus)--color-border-emphasisFunctional feedback
--webflow-blue (active)--color-activeState indication
--webflow-blue (button)--color-fg-primaryHigh contrast
--webflow-blue (link)--color-fg-secondaryReceding hierarchy
--webflow-blue (logo)--color-fg-primarySystem branding

2. Architecture

2.1 Harness Flow

┌─────────────────────────────────────────────────────────┐
│                    HARNESS RUNNER                        │
│                                                          │
│  Spec Parser ──► Issue Creation ──► Session Loop         │
│                                                          │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Session 1 ──► Session 2 ──► Session 3 ──► ...  │    │
│  │      │             │             │               │    │
│  │      ▼             ▼             ▼               │    │
│  │  Checkpoint    Checkpoint    Checkpoint          │    │
│  │      │             │             │               │    │
│  │      ▼             ▼             ▼               │    │
│  │  Peer Review   Peer Review   Peer Review         │    │
│  └─────────────────────────────────────────────────┘    │
│                                                          │
└──────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│              BEADS (Human Interface)                     │
│                                                          │
│  bd progress - Review checkpoints                        │
│  bd update   - Redirect priorities                       │
│  bd create   - Inject work                               │
└─────────────────────────────────────────────────────────┘

2.2 Session Spawning

Each session spawns Claude Code in headless mode with explicit configuration:

// packages/harness/src/session.ts
export async function runSession(
  issueId: string,
  prompt: string,
  options: SessionOptions = {}
): Promise<SessionResult> {
  const args = [
    '-p',
    '--allowedTools', HARNESS_ALLOWED_TOOLS,
    '--max-turns', options.maxTurns?.toString() ?? '100',
    '--output-format', 'json',
  ];

  if (options.model) {
    args.push('--model', options.model);
  }

  // Spawn claude process with captured stdout/stderr
  const result = await spawnClaude(args, prompt);

  // Parse structured JSON output
  const metrics = parseJsonOutput(result.stdout);

  return {
    issueId,
    outcome: determineOutcome(result),
    sessionId: metrics.sessionId,
    costUsd: metrics.costUsd,
    numTurns: metrics.numTurns,
  };
}

3. Migration Changes

3.1 Before: Legacy Pattern

const args = [
  '-p',
  '--dangerously-skip-permissions',
  '--output-format', 'json',
];

Characteristics:

  • All tools available without restriction
  • No runaway prevention
  • No cost tracking
  • No model selection
  • Security relies entirely on session isolation

3.2 After: Agent SDK Pattern

const args = [
  '-p',
  '--allowedTools', HARNESS_ALLOWED_TOOLS,
  '--max-turns', '100',
  '--output-format', 'json',
  '--model', options.model,
];

Characteristics:

  • Explicit tool allowlist (defense in depth)
  • Turn limit prevents infinite loops
  • JSON output enables metrics parsing
  • Model selection for cost optimization

3.3 Tool Categories

CategoryToolsPurpose
CoreRead, Write, Edit, Glob, Grep, NotebookEditFile operations
Bash Patternsgit:*, pnpm:*, npm:*, wrangler:*, bd:*, bv:*Scoped shell access
OrchestrationTask, TodoWrite, WebFetch, WebSearchAgent coordination
CREATE SomethingSkillCanon, deploy, audit skills
Infrastructuremcp__cloudflare__* (14 tools)KV, D1, R2, Workers

4. Peer Review Pipeline

The harness runs three peer reviewers at checkpoint boundaries:

const REVIEWERS: ReviewerConfig[] = [
  {
    name: 'security',
    prompt: 'Review the code changes for security vulnerabilities...',
    model: 'haiku',
    timeout: 30000,
  },
  {
    name: 'architecture',
    prompt: 'Review the code changes for architectural concerns...',
    model: 'haiku',
    timeout: 30000,
  },
  {
    name: 'quality',
    prompt: 'Review the code changes for quality issues...',
    model: 'haiku',
    timeout: 30000,
  },
];

4.1 Observed Review Outcomes

ReviewerPassPass w/FindingsFail
Security100%0%0%
Architecture40%60%0%
Quality100%0%0%

Finding: Architecture reviewer surfaces legitimate concerns (token consistency, pattern adherence) without blocking progress. This matches the intended "first-pass analysis" philosophy.

5. Empirical Observations

5.1 Security Improvements

ScenarioBeforeAfter
Arbitrary BashAllowedBlocked unless pattern-matched
File deletionUnrestrictedBash(rm:*) required
Network accessUnrestrictedWebFetch/WebSearch only
MCP toolsAll availableExplicit allowlist

Finding: No legitimate harness operations were blocked by the new restrictions. The allowlist is sufficient for all observed work patterns.

5.2 Runaway Prevention

--max-turns 100 prevents infinite loops. Observed session turn counts:

Task TypeAvg TurnsMax Observed
Simple CSS fix8-1522
Component refactor15-3045
Multi-file update25-5072

5.3 Cost Visibility

PhaseDescriptionEst. Cost
Phase 21Verification~$0.01
Phase 20GsapValidationModal~$0.02
Phase 19SubmissionTracker~$0.02
Phase 18ApiKeysManager~$0.03
.........

5.4 Model Selection Impact

ModelUse CaseCost RatioQuality
OpusComplex architectural changes1x (baseline)Highest
SonnetStandard implementation~0.2xHigh
HaikuSimple CSS fixes, reviews~0.05xSufficient

6. Trade-offs Analysis

6.1 Pros

BenefitImpactEvidence
Explicit SecurityHighNo unauthorized tool access possible
Runaway PreventionMedium100-turn limit prevents infinite loops
Cost VisibilityMediumPer-session cost tracking enabled
Model SelectionMedium10-20x cost reduction with Haiku
CREATE Something IntegrationHighSkill, Beads, Cloudflare MCP included

6.2 Cons

DrawbackImpactMitigation
Allowlist MaintenanceLowStable tool set; rare updates needed
Bash Pattern ComplexityMediumDocument patterns; provide examples
New Tool Discovery FrictionLowAdd to allowlist when needed

7. Recommendations

7.1 Immediate Adoption

  1. Replace --dangerously-skip-permissions with --allowedTools: The security improvement has no operational cost.
  2. Set --max-turns 100: Provides headroom without enabling runaways.
  3. Parse JSON output for metrics: Even if not displayed, capture for future analysis.
  4. Use Haiku for peer reviews: 95% cost reduction with equivalent quality.

7.2 Future Work

  1. Implement --resume: Use captured session_id for task continuity within epics.
  2. Model auto-selection: Use task complexity to choose Haiku/Sonnet/Opus.
  3. Cost budgets: Set per-harness-run cost limits with automatic pause.
  4. Streaming output: Use --output-format stream-json for real-time progress.

8. Conclusion

The Agent SDK migration improves the CREATE Something Harness without degrading operational capability. The explicit tool allowlist provides defense-in-depth security, while --max-turns prevents runaway sessions.

The key insight: restrictive defaults with explicit exceptions is more maintainable than permissive defaults with implicit risks.

This aligns with the Subtractive Triad:

  • DRY: One allowlist, not per-session permission decisions
  • Rams: Only necessary tools; each earns its place
  • Heidegger: Infrastructure recedes; security becomes invisible when correct

Appendix A: Full Tool Allowlist

const HARNESS_ALLOWED_TOOLS = [
  // Core file operations
  'Read', 'Write', 'Edit', 'Glob', 'Grep', 'NotebookEdit',

  // Bash with granular patterns
  'Bash(git:*)', 'Bash(pnpm:*)', 'Bash(npm:*)', 'Bash(npx:*)',
  'Bash(node:*)', 'Bash(tsc:*)', 'Bash(wrangler:*)',
  'Bash(bd:*)', 'Bash(bv:*)',  // Beads CLI
  'Bash(grep:*)', 'Bash(find:*)', 'Bash(ls:*)', 'Bash(cat:*)',
  'Bash(mkdir:*)', 'Bash(rm:*)', 'Bash(cp:*)', 'Bash(mv:*)',
  'Bash(echo:*)', 'Bash(test:*)',

  // Orchestration
  'Task', 'TodoWrite', 'WebFetch', 'WebSearch',

  // CREATE Something
  'Skill',

  // MCP Cloudflare
  'mcp__cloudflare__kv_get', 'mcp__cloudflare__kv_put',
  'mcp__cloudflare__kv_list', 'mcp__cloudflare__d1_query',
  'mcp__cloudflare__d1_list_databases',
  'mcp__cloudflare__r2_list_objects', 'mcp__cloudflare__r2_get_object',
  'mcp__cloudflare__r2_put_object', 'mcp__cloudflare__worker_list',
  'mcp__cloudflare__worker_get', 'mcp__cloudflare__worker_deploy',
].join(',');

References

"The harness recedes into transparent operation. Review progress. Redirect when needed."