Experiment
Meeting Capture
Tools Recede, Understanding Remains
╔══════════════════════════════════════════════════════════════════╗ ║ MEETING CAPTURE ║ ║ ║ ║ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ ║ ║ │ ((o)) │ │ ~ │ │ ___ │ │ [D1] │ ║ ║ │ AUDIO │ ──► │ WHISPER │ ──► │ TEXT │ ──► │ STORE │ ║ ║ └──────────┘ └──────────┘ └──────────┘ └─────────┘ ║ ║ ║ ║ Swift Workers AI Transcript Database ║ ║ (local) (Cloudflare) ║ ╚══════════════════════════════════════════════════════════════════╝
Hypothesis
A meeting transcription tool with fewer features will achieve higher actual utility than feature-rich alternatives—because tools that recede into transparent use (Zuhandenheit) serve understanding better than tools that demand attention.
Success Criteria
Tool requires no attention while recording
Actual: 15-30s for typical meetings
Tested with Zoom; Meet/Teams detection implemented
Partial: Manual stop required; detection unreliable
macOS Sonoma requires kernel extension; microphone only
Measured Results
| Metric | Target | Actual | Status |
|---|---|---|---|
| Setup time | <5 min | ~3 min | Pass |
| Transcription latency | <60s | 15-30s | Pass |
| Audio file size (30min) | <10MB | ~500KB | Pass |
| Interactions during meeting | 0 | 0 | Pass |
| Post-meeting interactions | 1 (stop) | 1 | Partial |
| System audio capture | Yes | No | Fail |
Grounding: Heidegger on Tools
In Being and Time (1927), Heidegger distinguishes two modes of encountering equipment:
Zuhandenheit
Ready-to-hand
The tool disappears into use. When hammering, we think about the nail—not the hammer. The tool serves the work; attention flows through it.
Vorhandenheit
Present-at-hand
The tool becomes an object of attention. When the hammer breaks, we suddenly see the hammer. It intrudes into consciousness.
"The less we just stare at the hammer-thing, and the more we seize hold of it and use it, the more primordial does our relationship to it become."
— Heidegger, Being and Time §15
Feature-rich tools like Granola and Otter shift equipment from Zuhanden to Vorhanden through accumulated complexity: speaker identification, action item extraction, calendar sync, team sharing. Each individually justified; collectively attention-demanding.
Method: Subtractive Triad
Following Rams's tenth principle ("As little design as possible") and the Subtractive Triad, we asked three questions for each potential feature:
DRY (Implementation)
"Have I built this before?"
→ Cloudflare D1, R2, Workers AI already exist. Use them.
Rams (Artifact)
"Does this feature earn its existence?"
→ Speaker diarization? Action items? No—complexity without proportional value.
Heidegger (System)
"Does this serve the whole?"
→ Store in D1 to join the CREATE SOMETHING knowledge graph.
Features Removed
- Real-time transcription — Complexity without benefit; post-meeting suffices
- Speaker identification — Adds 3x complexity; rarely needed for personal use
- Action item extraction — AI hallucination risk exceeds value
- Calendar integration — Meeting detection handles this
- Team sharing — Personal tool; single user
- Web dashboard —
curlsuffices for queries
Features Retained
- Auto-detect meetings — Zoom, Google Meet, Teams via NSWorkspace + AppleScript
- Record microphone — AVFoundation; system audio requires kernel extension
- Upload on stop — Multipart form to Cloudflare Worker
- Transcribe via Whisper — @cf/openai/whisper-large-v3-turbo
- Store in D1 — Searchable via SQL
Implementation
macOS Menubar App
Swift + SwiftUI · ~400 LOC
- • NSWorkspace for app monitoring
- • AppleScript → System Events for call detection
- • AVFoundation for microphone capture
- • URLSession for multipart upload
Cloudflare Worker
TypeScript · ~200 LOC
- • R2 for audio storage
- • Workers AI (Whisper) for transcription
- • D1 for transcript storage
- • ctx.waitUntil() for background processing
Technical Learnings
Whisper requires Base64, not ArrayBuffer
@cf/openai/whisper-large-v3-turbo expects Base64-encoded audio.
The standard Whisper model accepts integer arrays. This distinction caused
"Type mismatch of '/audio'" errors until resolved. Documentation unclear.
Unsigned apps cannot run AppleScript
macOS blocks unsigned apps from querying System Events, even with Automation
permission. Ad-hoc signing with com.apple.security.automation.apple-events entitlement resolves this without Developer ID.
ArrayBuffer detaches in async context
Audio data passed to ctx.waitUntil() becomes unusable. Solution:
store in R2 first, re-fetch within the async handler.
Limitations & Failures
System audio capture not implemented
macOS Sonoma requires a kernel extension or audio driver (like BlackHole) for system audio capture. This experiment captures microphone only—participant voices are clear, but shared content audio is not captured. A production tool would need to integrate with an audio routing solution.
Auto-stop detection unreliable
Zoom's helper processes (CptHost, caphost) don't terminate predictably when a meeting ends. AppleScript window-count detection has false positives. Current implementation requires manual stop. Acceptable for personal use; would need improvement for broader adoption.
Transcription accuracy varies
Whisper performs well for clear speech but struggles with overlapping voices, technical jargon, and non-English content. No quantitative WER measurement performed in this experiment.
Reproducibility
Prerequisites
- macOS 14+ (Sonoma) with Xcode 15+
- Cloudflare account with Workers AI access
- Microphone permission granted to app
- Automation permission for System Events
Expected Challenges
- Code signing: Without ad-hoc signing, AppleScript will fail silently
- Whisper input format: Use Base64, not ArrayBuffer for turbo model
- D1 FTS5: Full-text search virtual tables may cause database corruption
Starting Prompt
Build a macOS menubar app that: 1. Detects when Zoom/Meet/Teams starts a call 2. Records microphone audio 3. Uploads to a Cloudflare Worker on stop 4. Transcribes via Whisper and stores in D1 Use: Swift/SwiftUI, AVFoundation, NSWorkspace Constraints: No UI during recording, <500 LOC total
Conclusion
The hypothesis is partially validated. A minimal transcription tool achieves Zuhandenheit for its core function—zero attention during meetings, automatic transcription after. The tool recedes.
However, the manual stop requirement and microphone-only capture represent failures to fully disappear. A tool that demands even one interaction per meeting has not completely achieved ready-to-hand status.
"Weniger, aber besser."
— Dieter Rams
Less, but better. The discipline of removal produced a functional tool in ~600 LOC across two components. Whether "better" depends on use case: for personal meeting notes, this suffices. For team collaboration, the removed features become necessary.