PAPER-2026-002

Observability Infrastructure: Making AI Operations Visible

A three-layer observability architecture for AI-native systems: infrastructure tracing, LLM generation tracking, and agent coordination—unified through the AI Interaction Atlas vocabulary.

Infrastructure 15 min read Intermediate

Abstract

As AI agents become central to software development workflows, observability becomes critical. We present CREATE SOMETHING's three-layer observability architecture that provides comprehensive visibility into agent operations without impeding the work itself. The architecture combines Cloudflare Workers Automatic Tracing for infrastructure, Langfuse for LLM generation tracking, and Loom for agent coordination. All layers share a common vocabulary—the AI Interaction Atlas—enabling consistent analysis across touchpoints. The key insight: observability tools should exhibit Zuhandenheit (ready-to-hand)—providing visibility when needed while remaining invisible during normal operation. This paper documents the architecture, implementation, and the data captured at each layer.

Layer 1
Cloudflare Infrastructure

Workers, D1, KV, R2 operations

Layer 2
Langfuse LLM Tracing

Generations, tokens, costs

Layer 3
Loom Coordination

Sessions, issues, routing

1. The Problem

AI agents operate across multiple systems: they read files, call APIs, query databases, invoke LLMs, and coordinate with other agents. Traditional observability—designed for request-response web applications—fails to capture the unique characteristics of agent operations:

  • Long-running sessions: Agents may work for minutes or hours, not milliseconds
  • Multi-step workflows: A single task may involve dozens of LLM calls and tool invocations
  • Cost sensitivity: LLM tokens cost money; visibility into spend is critical
  • Non-determinism: The same input may produce different outputs; reproducibility requires capturing context
  • Human oversight: Some operations require approval; the observability system must track where humans intervened

The hermeneutic question: How do we make agent operations visible without making visibility itself the work? The tool must recede. You should think about the task, not the tracing.

2. Philosophy: Zuhandenheit

Heidegger distinguished between Zuhandenheit (ready-to-hand) and Vorhandenheit (present-at-hand). A hammer is ready-to-hand when you're hammering—you don't think about the hammer, you think about the nail. The hammer becomes present-at-hand when it breaks: suddenly you're aware of the tool itself.

Observability should be ready-to-hand. During normal operation, you shouldn't think about tracing—you should think about the work. But when something breaks, or when you need to understand costs, or when you're debugging an agent failure—then the observability system should provide rich, structured data exactly where you need it.

This philosophy drives several design decisions:

  • Automatic instrumentation: Tracing happens without explicit code in hot paths
  • Sampling: Not every request needs tracing; 10% sampling captures patterns without overhead
  • Unified vocabulary: The same terms (Atlas dimensions) work across all layers
  • Dashboards over logs: Aggregated views show patterns; raw logs available when needed

3. Architecture

3.1 Layer 1: Cloudflare Automatic Tracing

Cloudflare Workers provide built-in tracing for infrastructure operations. When enabled, every Worker invocation captures:

OperationData CapturedUse Case
Fetch callsURL, method, status, latencyExternal API dependencies
D1 queriesSQL, rows affected, durationDatabase performance
KV operationsKey, operation, sizeCache behavior
R2 accessBucket, object, bytesStorage patterns
Durable ObjectsClass, method, durationStateful coordination

Configuration is minimal—add to wrangler.jsonc:

"observability": {
  "enabled": true,
  "traces": {
    "enabled": true,
    "head_sampling_rate": 0.1
  }
}

3.2 Layer 2: Langfuse LLM Tracing

Langfuse provides purpose-built observability for LLM applications. It captures:

  • Traces: Top-level operations (an MCP tool call, an agent session)
  • Spans: Sub-operations within a trace (a database query, a file read)
  • Generations: LLM API calls with full input/output and token counts
  • Scores: Quality metrics attached to traces (success, latency, user feedback)

The @create-something/observability package wraps Langfuse with Atlas metadata:

import { createTrace, createGeneration } from '@create-something/observability';
import { mcpToolMetadata } from '@create-something/observability/atlas';

// Create trace with Atlas dimensions
const trace = createTrace({
  name: 'harness-mcp:get_priority',
  metadata: mcpToolMetadata('harness-mcp', 'get_priority', 'orchestrate')
});

// Track LLM generation
const gen = createGeneration(trace, {
  name: 'claude-completion',
  model: 'claude-sonnet-4-20250514',
  input: messages
});

// ... make LLM call ...

gen.end(response, { input: 150, output: 500 });

3.3 Layer 3: Loom Agent Coordination

Loom (lm) provides agent-native task management with built-in observability:

  • Sessions: Agent work sessions with start/end times and cost tracking
  • Issues: Task state (pending, in-progress, completed, blocked)
  • Routing: Model selection decisions with confidence scores
  • Cost: Token usage and dollar amounts per session

Loom stores data in SQLite (.loom/loom.db), enabling offline analysis and crash recovery. The data syncs with Langfuse for unified dashboards.

4. The AI Interaction Atlas

All three layers share a common vocabulary: the AI Interaction Atlas from quietloudlab. The Atlas defines six dimensions for categorizing AI interactions:

DimensionDescriptionExamples
AI TasksWhat capabilities AI providesgenerate, classify, orchestrate
Human TasksWhat people do in the loopreview, approve, edit
System TasksWhat infrastructure handlesrouting, logging, validation
Data ArtifactsWhat information flowsprompt, completion, context
ConstraintsWhat boundaries applylatency, cost, privacy
TouchpointsWhere interactions happenmcp_server, api, worker

By annotating all traces with Atlas metadata, we can query across layers. For example: "Show all operations where ai_task.type = 'generate' and constraint.budget_usd > 1.00."

5. What Data Gets Captured

5.1 MCP Server Tool Calls

Every MCP tool invocation creates a trace with:

  • Server name and tool name
  • Input parameters (sanitized)
  • Output/error response
  • Duration and timestamp
  • Atlas metadata (touchpoint, AI task type)

5.2 LLM Generations

Each Claude API call captures:

  • Model identifier (claude-sonnet-4-20250514)
  • Input messages (or summary for privacy)
  • Output completion
  • Token usage: input, output, total
  • Cost calculation
  • Parent trace/span for correlation

5.3 Agent Sessions

The agentic-executor tracks:

  • Session ID and issue ID
  • Budget allocation and consumption
  • Iteration count and costs per iteration
  • Files modified
  • Status (running, paused, complete, error)
  • Termination reason

5.4 Infrastructure Operations

Cloudflare automatic tracing captures:

  • Worker cold starts and execution time
  • Subrequest chains (fetch → fetch → fetch)
  • Database query plans and timing
  • Cache hit/miss ratios
  • Memory and CPU usage

6. Implementation

The observability stack is deployed across CREATE SOMETHING properties:

ComponentLocationPurpose
@create-something/observabilitypackages/observability/Shared Langfuse wrapper with Atlas types
MCP instrumentationpackages/*-mcp/Tool call tracing for all MCP servers
Agentic executorpackages/space/workers/agentic-executor/LLM generation tracing for agent sessions
Observability dashboardpackages/io/src/routes/admin/observability/Unified view of all metrics

Configuration:

  • Langfuse project: CREATE SOMETHING (US cloud region)
  • Cloudflare tracing: 10% sampling rate
  • Secrets managed via wrangler pages secret

7. Viewing the Data

Langfuse Dashboard:

  • Traces by time: us.cloud.langfuse.com
  • Cost by model: Token usage and spend breakdown
  • Generation latency: P50, P90, P99 percentiles
  • Scores: Success rates, user feedback

Cloudflare Dashboard:

  • Workers → Analytics → Traces
  • Flame graphs for request timing
  • Subrequest waterfall charts
  • Error rate trends

Internal Dashboard:

8. Lessons Learned

1. Automatic > Manual

Manual instrumentation creates friction. Developers skip it when rushed. Automatic tracing (Cloudflare) and wrapper functions (observability package) ensure coverage without cognitive overhead.

2. Sampling Is Essential

100% tracing creates data overload and performance overhead. 10% sampling captures patterns while keeping costs manageable. Increase sampling when debugging specific issues.

3. Shared Vocabulary Enables Analysis

The Atlas vocabulary lets us ask questions like "What's the cost of all generate tasks?" across MCP servers, Workers, and LLM calls. Without shared terminology, each layer is an island.

4. Dashboards Over Logs

Raw logs are necessary but insufficient. Aggregated dashboards show patterns—cost spikes, latency regressions, error clusters. Start with dashboards; drill into logs when needed.

9. Conclusion

AI-native systems require observability designed for their unique characteristics: long-running sessions, multi-step workflows, cost sensitivity, and human oversight. The three-layer architecture—Cloudflare for infrastructure, Langfuse for LLM tracing, Loom for coordination—provides comprehensive visibility while adhering to the principle of Zuhandenheit: the tools recede into transparent use.

The AI Interaction Atlas vocabulary unifies analysis across layers, enabling queries that span from database operations to LLM generations to human approvals. This is the foundation for understanding, optimizing, and debugging AI agent operations at scale.

Status: ✅ Production deployed across all CREATE SOMETHING properties.

Related Research

Haiku Optimization — Intelligent model routing with cost tracking

AI Interaction Atlas — Shared vocabulary for AI interaction design

Langfuse Documentation — Open-source LLM observability platform