PAPER-2026-003

Haiku + Ultrathink Validation

Can extended thinking mode close the gap between Haiku and Sonnet? An empirical test of the community claim that Haiku 4.5 + ultrathink achieves ~90% of Sonnet's performance at 10% of the cost.

Experiment • Proposed • Technical

📋 Experiment Status

This paper documents a proposed experiment to validate community claims about Haiku 4.5 + ultrathink performance. Results will be added as the experiment progresses.

Expected completion: January 2026

Abstract

Community reports suggest Haiku 4.5 with extended thinking ("ultrathink") achieves approximately 90% of Sonnet 4.5's performance on planning and refactoring tasks while maintaining Haiku's 10x cost advantage. If true, this unlocks a powerful middle tier for model routing: reserve Sonnet for multi-file coordination and Opus for architecture, but use Haiku + ultrathink for the majority of development work. This experiment tests the claim empirically across 10 tasks spanning trivial to complex complexity, measuring quality match, cost savings, and time overhead. Success criteria: ≥85% quality match, ≥85% cost savings, ≤150% time overhead.

The Community Claim

Haiku 4.5 + ultrathink achieves ~90% of Sonnet's performance
At 10% of the cost (Haiku: ~$0.001, Sonnet: ~$0.01)
4-5x faster execution

I. The Hypothesis

Extended thinking mode (ultrathink) fundamentally changes Haiku's capabilities. Where standard Haiku excels at pattern matching and simple execution, ultrathink enables:

Multi-step reasoning
Complex planning
Debugging loops
Architecture decisions

If validated, this means most development work can use Haiku + ultrathink, reserving Sonnet only for multi-file coordination and Opus for deep architecture review.

Three Hypotheses

H1: Quality

Haiku + ultrathink achieves ≥85% quality match vs Sonnet on planning/refactoring tasks

H2: Cost

Haiku + ultrathink costs ≤15% of Sonnet for equivalent work (≥85% savings)

H3: Speed

Haiku + ultrathink completes in ≤150% of Sonnet's time (acceptable if quality holds)

II. Test Design

Task Selection

Ten tasks spanning trivial to complex complexity:

ID	Type	Complexity	Description
T1	Refactor	Trivial	Extract duplicate validation logic
T2	Feature	Simple	Add pagination to existing list
T3	Bug fix	Simple	Fix TypeScript type errors
T4	Refactor	Standard	Restructure auth module (DRY violations)
T5	Feature	Standard	Add caching layer to API
T6	Planning	Standard	Design database migration strategy
T7	Debug	Standard	Fix intermittent test failures
T8	Refactor	Complex	Extract shared business logic (5 files)
T9	Feature	Complex	Implement OAuth flow with PKCE
T10	Architecture	Complex	Design multi-tenant routing strategy

Execution Pattern

Each task executed twice following identical workflow:

Run 1: Haiku + Ultrathink

Explore: Read relevant files
Plan: "ultrathink. Propose plan. Don't code yet."
Review: Human approves plan
Code: "Implement the plan"
Verify: Tests pass, acceptance met

Run 2: Sonnet Baseline

Explore: Same files
Plan: Same planning (no ultrathink)
Review: Same approval
Code: Same implementation
Verify: Same verification

Metrics Tracked

For each execution:

Quality: Tests pass, acceptance met, plan quality rating
Cost: API cost, tokens used
Time: Exploration, planning, coding, total
Outcome: Success/failure, revision count, notes

III. Success Criteria

Three validation tiers:

Minimum Viable

7/10 tasks succeed
≥85% cost savings
All successes pass tests

Strong Validation

9/10 tasks succeed
≥90% cost savings
Time overhead <120%

Conclusive Proof

10/10 tasks succeed
≥90% cost savings
Time ≤100% (same/faster)

Pass threshold: Minimum viable. Strong validation confirms routing strategy. Conclusive proof enables aggressive adoption.

IV. Expected Impact

If hypotheses validate, this unlocks a four-tier model routing strategy:

Complexity	Model	Cost	Use Case
Trivial	Haiku	~$0.001	Pattern matching, simple edits
Simple-Standard	Haiku + ultrathink	~$0.001	Planning, refactoring, debugging
Standard	Sonnet	~$0.01	Multi-file coordination
Complex	Opus	~$0.10	Architecture, security review

Cost Projections

Before (current documented routing):

10 trivial → Haiku: $0.01
70 standard → Sonnet: $0.70
20 complex → Opus: $2.00
Total: $2.71/100 tasks

After (with Haiku + ultrathink tier):

10 trivial → Haiku: $0.01
60 simple-standard → Haiku + ultrathink: $0.06
10 multi-file → Sonnet: $0.10
20 complex → Opus: $2.00
Total: $2.17/100 tasks (~20% savings)

More importantly, this enables faster execution (4-5x) on 60% of tasks while maintaining quality.

V. Implementation Plan

Phase 1: Task Preparation (30 min)

Create 10 Beads issues
Label with complexity tiers
Define acceptance criteria
Prepare test environments

Phase 2: Haiku + Ultrathink Execution (3-4 hours)

Execute all 10 tasks with Haiku + ultrathink
Track metrics for each
Document plan quality
Note failures and issues

Phase 3: Sonnet Baseline Execution (2-3 hours)

Reset to pre-task state
Execute same 10 tasks with Sonnet
Track same metrics
Compare outcomes

Phase 4: Analysis (1 hour)

Calculate quality match percentage
Calculate cost savings
Calculate time overhead
Identify patterns (where Haiku wins, struggles)

VI. Results

⏳ Experiment Pending

Results will be added here as experiment phases complete. Expected completion: January 2026.

VII. Implications

If Validated

Haiku + ultrathink becomes the default for most development work:

Harness integration: Add ultrathink flag to task execution routing
Gastown workers: Default to Haiku + ultrathink for convoy work
Ralph loops: Use Haiku + ultrathink for iterative refinement, escalate to Sonnet if stuck
Documentation: Update model-routing-optimization.md with validated tier

This changes the economics: 60-70% of tasks move from $0.01 to $0.001, reducing monthly costs by ~20% while maintaining quality.

If Not Validated

Document where it fails and why:

Which complexity tiers succeed vs fail
Whether time overhead is too high
Whether plan quality is poor despite code working
Specific failure modes (debugging loops, multi-file reasoning, etc.)

Update routing recommendations with failure patterns documented.

The Broader Pattern

This experiment continues the Subtractive Triad at the meta-level: we're removing cost without sacrificing quality. If extended thinking closes the Haiku-Sonnet gap, we've found a leverage point (Meadows) for cost optimization.

The philosophy recedes; only the work remains. You think about the task, not which model to use.