Executive Thesis
Braintrust's practical advantage is not that it shows errors. It is that it unsurfaces hidden operational structure inside "mostly successful" traffic.
In this snapshot, only 71/1000 rows are errors (7.1%). A naive read says "system is mostly fine." The trace-level read says otherwise: two failure classes (permission and intent_routing) account for 76.1% of all observed failures.
Snapshot Evidence (Mar 4, 2026)
Project ID: 8ca0d63b-d985-4373-9906-c253bf3f52d0
Window: Mar 1, 2026 9:55 AM to Mar 4, 2026 5:30 AM (America/Chicago)
Rows: 1,000
Error rows: 71
Error Composition
permission:30(42.3%)intent_routing:24(33.8%)rate_limit:4(5.6%)validation:4(5.6%)
Hidden Reliability Risks Unsurfaced
- Permission failure clustering
- Repeated forbidden signatures appeared in bursts (for example, "You don't have permission to access this post.").
- Intent-router brittleness
hub_route_intentproduced22errors out of36calls (61.1%error rate), concentrated around synonym variants for Sheets tasks.
- Throttle duplication
- 429 responses (
TOO_MANY_REQUESTS,serviceErrorCode=101) appeared as repeatable patterns, indicating missing circuit-break behavior.
- 429 responses (
- Tail-latency instability
hub_update_statereached252,517 ms, which is over 4 minutes for a control-plane path that should be predictable.
These are not independent bugs. They represent a reliability topology: permissions, routing, and control-plane latency interacting under real workload.
From Trace to Ranked Experiments
We translated trace findings into five ranked experiments with exact acceptance criteria and dashboard specs:
- EXP-01 LinkedIn permission preflight
- EXP-02 Intent canonicalization + semantic fallback
- EXP-03 Provider 429 circuit breaker
- EXP-04 Control-plane cache + latency stabilization
- EXP-05 Tool-argument auto-repair
Specification index: docs/internal/braintrust-experiments/README.md
Why the Dashboard Design Matters
The dashboard uses Tufte-style high data density and direct labeling to reduce interpretive noise:
- Minimal chrome, maximal signal
- Error composition and tool reliability in one glance
- Repeated cluster table to expose recurrence rather than isolated incidents
- Latency outlier table to keep tails visible
This prevents the common failure mode where summary metrics hide operational recurrence.
Operational Loop
Each experiment now has:
- exact acceptance criteria
- explicit metrics and formulas
- dashboard panel requirements
- baseline evidence from the March 4 snapshot
This makes reliability work executable: the team can ship, measure, and gate promotion decisions against objective thresholds instead of subjective confidence.
Conclusion
Braintrust did not just report that errors existed. It unsurfaced where the system was structurally fragile despite high apparent throughput. That is the difference between observability as logging and observability as decision infrastructure.