PAPER-2026-003

Haiku + Ultrathink Validation

Can extended thinking mode close the gap between Haiku and Sonnet? An empirical test of the community claim that Haiku 4.5 + ultrathink achieves ~90% of Sonnet's performance at 10% of the cost.

Experiment Proposed Technical

📋 Experiment Status

This paper documents a proposed experiment to validate community claims about Haiku 4.5 + ultrathink performance. Results will be added as the experiment progresses.

Expected completion: January 2026

Abstract

Community reports suggest Haiku 4.5 with extended thinking ("ultrathink") achieves approximately 90% of Sonnet 4.5's performance on planning and refactoring tasks while maintaining Haiku's 10x cost advantage. If true, this unlocks a powerful middle tier for model routing: reserve Sonnet for multi-file coordination and Opus for architecture, but use Haiku + ultrathink for the majority of development work. This experiment tests the claim empirically across 10 tasks spanning trivial to complex complexity, measuring quality match, cost savings, and time overhead. Success criteria: ≥85% quality match, ≥85% cost savings, ≤150% time overhead.

The Community Claim

Haiku 4.5 + ultrathink achieves ~90% of Sonnet's performance
At 10% of the cost (Haiku: ~$0.001, Sonnet: ~$0.01)
4-5x faster execution

I. The Hypothesis

Extended thinking mode (ultrathink) fundamentally changes Haiku's capabilities. Where standard Haiku excels at pattern matching and simple execution, ultrathink enables:

  • Multi-step reasoning
  • Complex planning
  • Debugging loops
  • Architecture decisions

If validated, this means most development work can use Haiku + ultrathink, reserving Sonnet only for multi-file coordination and Opus for deep architecture review.

Three Hypotheses

H1: Quality

Haiku + ultrathink achieves ≥85% quality match vs Sonnet on planning/refactoring tasks

H2: Cost

Haiku + ultrathink costs ≤15% of Sonnet for equivalent work (≥85% savings)

H3: Speed

Haiku + ultrathink completes in ≤150% of Sonnet's time (acceptable if quality holds)

II. Test Design

Task Selection

Ten tasks spanning trivial to complex complexity:

IDTypeComplexityDescription
T1RefactorTrivialExtract duplicate validation logic
T2FeatureSimpleAdd pagination to existing list
T3Bug fixSimpleFix TypeScript type errors
T4RefactorStandardRestructure auth module (DRY violations)
T5FeatureStandardAdd caching layer to API
T6PlanningStandardDesign database migration strategy
T7DebugStandardFix intermittent test failures
T8RefactorComplexExtract shared business logic (5 files)
T9FeatureComplexImplement OAuth flow with PKCE
T10ArchitectureComplexDesign multi-tenant routing strategy

Execution Pattern

Each task executed twice following identical workflow:

Run 1: Haiku + Ultrathink

  1. Explore: Read relevant files
  2. Plan: "ultrathink. Propose plan. Don't code yet."
  3. Review: Human approves plan
  4. Code: "Implement the plan"
  5. Verify: Tests pass, acceptance met

Run 2: Sonnet Baseline

  1. Explore: Same files
  2. Plan: Same planning (no ultrathink)
  3. Review: Same approval
  4. Code: Same implementation
  5. Verify: Same verification

Metrics Tracked

For each execution:

  • Quality: Tests pass, acceptance met, plan quality rating
  • Cost: API cost, tokens used
  • Time: Exploration, planning, coding, total
  • Outcome: Success/failure, revision count, notes

III. Success Criteria

Three validation tiers:

Minimum Viable

  • 7/10 tasks succeed
  • ≥85% cost savings
  • All successes pass tests

Strong Validation

  • 9/10 tasks succeed
  • ≥90% cost savings
  • Time overhead <120%

Conclusive Proof

  • 10/10 tasks succeed
  • ≥90% cost savings
  • Time ≤100% (same/faster)

Pass threshold: Minimum viable. Strong validation confirms routing strategy. Conclusive proof enables aggressive adoption.

IV. Expected Impact

If hypotheses validate, this unlocks a four-tier model routing strategy:

ComplexityModelCostUse Case
TrivialHaiku~$0.001Pattern matching, simple edits
Simple-StandardHaiku + ultrathink~$0.001Planning, refactoring, debugging
StandardSonnet~$0.01Multi-file coordination
ComplexOpus~$0.10Architecture, security review

Cost Projections

Before (current documented routing):

  • 10 trivial → Haiku: $0.01
  • 70 standard → Sonnet: $0.70
  • 20 complex → Opus: $2.00
  • Total: $2.71/100 tasks

After (with Haiku + ultrathink tier):

  • 10 trivial → Haiku: $0.01
  • 60 simple-standard → Haiku + ultrathink: $0.06
  • 10 multi-file → Sonnet: $0.10
  • 20 complex → Opus: $2.00
  • Total: $2.17/100 tasks (~20% savings)

More importantly, this enables faster execution (4-5x) on 60% of tasks while maintaining quality.

V. Implementation Plan

Phase 1: Task Preparation (30 min)

  • Create 10 Beads issues
  • Label with complexity tiers
  • Define acceptance criteria
  • Prepare test environments

Phase 2: Haiku + Ultrathink Execution (3-4 hours)

  • Execute all 10 tasks with Haiku + ultrathink
  • Track metrics for each
  • Document plan quality
  • Note failures and issues

Phase 3: Sonnet Baseline Execution (2-3 hours)

  • Reset to pre-task state
  • Execute same 10 tasks with Sonnet
  • Track same metrics
  • Compare outcomes

Phase 4: Analysis (1 hour)

  • Calculate quality match percentage
  • Calculate cost savings
  • Calculate time overhead
  • Identify patterns (where Haiku wins, struggles)

VI. Results

⏳ Experiment Pending

Results will be added here as experiment phases complete. Expected completion: January 2026.

VII. Implications

If Validated

Haiku + ultrathink becomes the default for most development work:

  • Harness integration: Add ultrathink flag to task execution routing
  • Gastown workers: Default to Haiku + ultrathink for convoy work
  • Ralph loops: Use Haiku + ultrathink for iterative refinement, escalate to Sonnet if stuck
  • Documentation: Update model-routing-optimization.md with validated tier

This changes the economics: 60-70% of tasks move from $0.01 to $0.001, reducing monthly costs by ~20% while maintaining quality.

If Not Validated

Document where it fails and why:

  • Which complexity tiers succeed vs fail
  • Whether time overhead is too high
  • Whether plan quality is poor despite code working
  • Specific failure modes (debugging loops, multi-file reasoning, etc.)

Update routing recommendations with failure patterns documented.

The Broader Pattern

This experiment continues the Subtractive Triad at the meta-level: we're removing cost without sacrificing quality. If extended thinking closes the Haiku-Sonnet gap, we've found a leverage point (Meadows) for cost optimization.

The philosophy recedes; only the work remains. You think about the task, not which model to use.

Related Research