CREATE SOMETHING

Executive Thesis

Braintrust's practical advantage is not that it shows errors. It is that it unsurfaces hidden operational structure inside "mostly successful" traffic.

In this snapshot, only 71/1000 rows are errors (7.1%). A naive read says "system is mostly fine." The trace-level read says otherwise: two failure classes (permission and intent_routing) account for 76.1% of all observed failures.

Snapshot Evidence (Mar 4, 2026)

Project ID: 8ca0d63b-d985-4373-9906-c253bf3f52d0
Window: Mar 1, 2026 9:55 AM to Mar 4, 2026 5:30 AM (America/Chicago)
Rows: 1,000
Error rows: 71

Error Composition

permission: 30 (42.3%)
intent_routing: 24 (33.8%)
rate_limit: 4 (5.6%)
validation: 4 (5.6%)

Hidden Reliability Risks Unsurfaced

Permission failure clustering
- Repeated forbidden signatures appeared in bursts (for example, "You don't have permission to access this post.").
Intent-router brittleness
- hub_route_intent produced 22 errors out of 36 calls (61.1% error rate), concentrated around synonym variants for Sheets tasks.
Throttle duplication
- 429 responses (TOO_MANY_REQUESTS, serviceErrorCode=101) appeared as repeatable patterns, indicating missing circuit-break behavior.
Tail-latency instability
- hub_update_state reached 252,517 ms, which is over 4 minutes for a control-plane path that should be predictable.

These are not independent bugs. They represent a reliability topology: permissions, routing, and control-plane latency interacting under real workload.

From Trace to Ranked Experiments

We translated trace findings into five ranked experiments with exact acceptance criteria and dashboard specs:

EXP-01 LinkedIn permission preflight
EXP-02 Intent canonicalization + semantic fallback
EXP-03 Provider 429 circuit breaker
EXP-04 Control-plane cache + latency stabilization
EXP-05 Tool-argument auto-repair

Specification index: docs/internal/braintrust-experiments/README.md

Why the Dashboard Design Matters

The dashboard uses Tufte-style high data density and direct labeling to reduce interpretive noise:

Minimal chrome, maximal signal
Error composition and tool reliability in one glance
Repeated cluster table to expose recurrence rather than isolated incidents
Latency outlier table to keep tails visible

This prevents the common failure mode where summary metrics hide operational recurrence.

Operational Loop

Each experiment now has:

exact acceptance criteria
explicit metrics and formulas
dashboard panel requirements
baseline evidence from the March 4 snapshot

This makes reliability work executable: the team can ship, measure, and gate promotion decisions against objective thresholds instead of subjective confidence.

Conclusion

Braintrust did not just report that errors existed. It unsurfaced where the system was structurally fragile despite high apparent throughput. That is the difference between observability as logging and observability as decision infrastructure.