PAPER-2026-002

Observability Infrastructure: Making AI Operations Visible

A three-layer observability architecture for AI-native systems: infrastructure tracing, LLM generation tracking, and agent coordination—unified through the AI Interaction Atlas vocabulary.

Infrastructure • 15 min read • Intermediate

Abstract

As AI agents become central to software development workflows, observability becomes critical. We present CREATE SOMETHING's three-layer observability architecture that provides comprehensive visibility into agent operations without impeding the work itself. The architecture combines Cloudflare Workers Automatic Tracing for infrastructure, Langfuse for LLM generation tracking, and Loom for agent coordination. All layers share a common vocabulary—the AI Interaction Atlas—enabling consistent analysis across touchpoints. The key insight: observability tools should exhibit Zuhandenheit (ready-to-hand)—providing visibility when needed while remaining invisible during normal operation. This paper documents the architecture, implementation, and the data captured at each layer.

Layer 1

Cloudflare Infrastructure

Workers, D1, KV, R2 operations

Layer 2

Langfuse LLM Tracing

Generations, tokens, costs

Layer 3

Loom Coordination

Sessions, issues, routing

1. The Problem

AI agents operate across multiple systems: they read files, call APIs, query databases, invoke LLMs, and coordinate with other agents. Traditional observability—designed for request-response web applications—fails to capture the unique characteristics of agent operations:

Long-running sessions: Agents may work for minutes or hours, not milliseconds
Multi-step workflows: A single task may involve dozens of LLM calls and tool invocations
Cost sensitivity: LLM tokens cost money; visibility into spend is critical
Non-determinism: The same input may produce different outputs; reproducibility requires capturing context
Human oversight: Some operations require approval; the observability system must track where humans intervened

The hermeneutic question: How do we make agent operations visible without making visibility itself the work? The tool must recede. You should think about the task, not the tracing.

2. Philosophy: Zuhandenheit

Heidegger distinguished between Zuhandenheit (ready-to-hand) and Vorhandenheit (present-at-hand). A hammer is ready-to-hand when you're hammering—you don't think about the hammer, you think about the nail. The hammer becomes present-at-hand when it breaks: suddenly you're aware of the tool itself.

Observability should be ready-to-hand. During normal operation, you shouldn't think about tracing—you should think about the work. But when something breaks, or when you need to understand costs, or when you're debugging an agent failure—then the observability system should provide rich, structured data exactly where you need it.

This philosophy drives several design decisions:

Automatic instrumentation: Tracing happens without explicit code in hot paths
Sampling: Not every request needs tracing; 10% sampling captures patterns without overhead
Unified vocabulary: The same terms (Atlas dimensions) work across all layers
Dashboards over logs: Aggregated views show patterns; raw logs available when needed

3. Architecture

3.1 Layer 1: Cloudflare Automatic Tracing

Cloudflare Workers provide built-in tracing for infrastructure operations. When enabled, every Worker invocation captures:

Operation	Data Captured	Use Case
Fetch calls	URL, method, status, latency	External API dependencies
D1 queries	SQL, rows affected, duration	Database performance
KV operations	Key, operation, size	Cache behavior
R2 access	Bucket, object, bytes	Storage patterns
Durable Objects	Class, method, duration	Stateful coordination

Configuration is minimal—add to wrangler.jsonc:

"observability": {
  "enabled": true,
  "traces": {
    "enabled": true,
    "head_sampling_rate": 0.1
  }
}

3.2 Layer 2: Langfuse LLM Tracing

Langfuse provides purpose-built observability for LLM applications. It captures:

Traces: Top-level operations (an MCP tool call, an agent session)
Spans: Sub-operations within a trace (a database query, a file read)
Generations: LLM API calls with full input/output and token counts
Scores: Quality metrics attached to traces (success, latency, user feedback)

The @create-something/observability package wraps Langfuse with Atlas metadata:

import { createTrace, createGeneration } from '@create-something/observability';
import { mcpToolMetadata } from '@create-something/observability/atlas';

// Create trace with Atlas dimensions
const trace = createTrace({
  name: 'harness-mcp:get_priority',
  metadata: mcpToolMetadata('harness-mcp', 'get_priority', 'orchestrate')
});

// Track LLM generation
const gen = createGeneration(trace, {
  name: 'claude-completion',
  model: 'claude-sonnet-4-20250514',
  input: messages
});

// ... make LLM call ...

gen.end(response, { input: 150, output: 500 });

3.3 Layer 3: Loom Agent Coordination

Loom (lm) provides agent-native task management with built-in observability:

Sessions: Agent work sessions with start/end times and cost tracking
Issues: Task state (pending, in-progress, completed, blocked)
Routing: Model selection decisions with confidence scores
Cost: Token usage and dollar amounts per session

Loom stores data in SQLite (.loom/loom.db), enabling offline analysis and crash recovery. The data syncs with Langfuse for unified dashboards.

4. The AI Interaction Atlas

All three layers share a common vocabulary: the AI Interaction Atlas from quietloudlab. The Atlas defines six dimensions for categorizing AI interactions:

Dimension	Description	Examples
AI Tasks	What capabilities AI provides	`generate`, `classify`, `orchestrate`
Human Tasks	What people do in the loop	`review`, `approve`, `edit`
System Tasks	What infrastructure handles	`routing`, `logging`, `validation`
Data Artifacts	What information flows	`prompt`, `completion`, `context`
Constraints	What boundaries apply	`latency`, `cost`, `privacy`
Touchpoints	Where interactions happen	`mcp_server`, `api`, `worker`

By annotating all traces with Atlas metadata, we can query across layers. For example: "Show all operations where ai_task.type = 'generate' and constraint.budget_usd > 1.00."

5. What Data Gets Captured

5.1 MCP Server Tool Calls

Every MCP tool invocation creates a trace with:

Server name and tool name
Input parameters (sanitized)
Output/error response
Duration and timestamp
Atlas metadata (touchpoint, AI task type)

5.2 LLM Generations

Each Claude API call captures:

Model identifier (claude-sonnet-4-20250514)
Input messages (or summary for privacy)
Output completion
Token usage: input, output, total
Cost calculation
Parent trace/span for correlation

5.3 Agent Sessions

The agentic-executor tracks:

Session ID and issue ID
Budget allocation and consumption
Iteration count and costs per iteration
Files modified
Status (running, paused, complete, error)
Termination reason

5.4 Infrastructure Operations

Cloudflare automatic tracing captures:

Worker cold starts and execution time
Subrequest chains (fetch → fetch → fetch)
Database query plans and timing
Cache hit/miss ratios
Memory and CPU usage

6. Implementation

The observability stack is deployed across CREATE SOMETHING properties:

Component	Location	Purpose
`@create-something/observability`	`packages/observability/`	Shared Langfuse wrapper with Atlas types
MCP instrumentation	`packages/*-mcp/`	Tool call tracing for all MCP servers
Agentic executor	`packages/space/workers/agentic-executor/`	LLM generation tracing for agent sessions
Observability dashboard	`packages/io/src/routes/admin/observability/`	Unified view of all metrics

Configuration:

Langfuse project: CREATE SOMETHING (US cloud region)
Cloudflare tracing: 10% sampling rate
Secrets managed via wrangler pages secret

7. Viewing the Data

Langfuse Dashboard:

Traces by time: us.cloud.langfuse.com
Cost by model: Token usage and spend breakdown
Generation latency: P50, P90, P99 percentiles
Scores: Success rates, user feedback

Cloudflare Dashboard:

Workers → Analytics → Traces
Flame graphs for request timing
Subrequest waterfall charts
Error rate trends

Internal Dashboard:

createsomething.io/admin/observability
Unified view aggregating all three layers
Task summary with Atlas dimension breakdown
Cost trends over time

8. Lessons Learned

1. Automatic > Manual

Manual instrumentation creates friction. Developers skip it when rushed. Automatic tracing (Cloudflare) and wrapper functions (observability package) ensure coverage without cognitive overhead.

2. Sampling Is Essential

100% tracing creates data overload and performance overhead. 10% sampling captures patterns while keeping costs manageable. Increase sampling when debugging specific issues.

3. Shared Vocabulary Enables Analysis

The Atlas vocabulary lets us ask questions like "What's the cost of all generate tasks?" across MCP servers, Workers, and LLM calls. Without shared terminology, each layer is an island.

4. Dashboards Over Logs

Raw logs are necessary but insufficient. Aggregated dashboards show patterns—cost spikes, latency regressions, error clusters. Start with dashboards; drill into logs when needed.

9. Conclusion

AI-native systems require observability designed for their unique characteristics: long-running sessions, multi-step workflows, cost sensitivity, and human oversight. The three-layer architecture—Cloudflare for infrastructure, Langfuse for LLM tracing, Loom for coordination—provides comprehensive visibility while adhering to the principle of Zuhandenheit: the tools recede into transparent use.

The AI Interaction Atlas vocabulary unifies analysis across layers, enabling queries that span from database operations to LLM generations to human approvals. This is the foundation for understanding, optimizing, and debugging AI agent operations at scale.

Status: ✅ Production deployed across all CREATE SOMETHING properties.

Related Research

Haiku Optimization — Intelligent model routing with cost tracking

AI Interaction Atlas — Shared vocabulary for AI interaction design

Langfuse Documentation — Open-source LLM observability platform