Webflow Plagiarism Detection: Agent-Native Algorithms

╔══════════════════════════════════════════════════════════════════╗
║  WEBFLOW PLAGIARISM DETECTION                    v2.3.0          ║
║  ────────────────────────────────────────────────────────────    ║
║  9,593 templates │ 517,850 JS functions │ $2.20/month            ║
║                                                                  ║
║  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐      ║
║  │ MinHash  │──▶│   LSH    │──▶│ PageRank │──▶│ Bayesian │      ║
║  │(1997)    │   │ (1998)   │   │ (1996)   │   │ (1763)   │      ║
║  └──────────┘   └──────────┘   └──────────┘   └──────────┘      ║
║       │              │              │              │              ║
║       └──────────────┴──────────────┴──────────────┘              ║
║                             │                                     ║
║                      ╔══════▼══════╗                              ║
║                      ║  MCP Tools  ║                              ║
║                      ║ (10 tools)  ║                              ║
║                      ╚═════════════╝                              ║
║                                                                  ║
║  "Classic CS algorithms wrapped as tools for AI agent use"       ║
╚══════════════════════════════════════════════════════════════════╝

Hypothesis

Agent-native design—exposing classic algorithms as MCP tools—enables team AI agents
to perform sophisticated template analysis without custom integrations. The algorithms
do the heavy lifting; AI handles edge cases requiring judgment.

The Problem

Webflow Marketplace receives plagiarism reports comparing two templates. Manual review
is expensive ($625/month for 50 cases). We needed a system that could:

Fingerprint 9,500+ templates efficiently
Detect similarity at multiple levels (code, structure, semantics)
Distinguish originals from derivatives
Flag edge cases for human review
Enable any team member's AI agent to invoke analysis

Architecture

The system uses a layered detection approach:

Template URL
    ↓
┌───────────────────┐
│  Bloom Filter     │ ─── Already indexed? Skip (O(1))
└───────────────────┘
    ↓
┌───────────────────┐
│  SuperMinHash     │ ─── 128-permutation fingerprint
│  + LSH Banding    │ ─── 16 bands for O(1) lookup
└───────────────────┘
    ↓
┌───────────────────┐
│  Vector Embed     │ ─── OpenAI text-embedding-3-small
│  (Semantic)       │ ─── 1536 dimensions
└───────────────────┘
    ↓
┌───────────────────┐
│  Bayesian Score   │ ─── Combine signals → probability
└───────────────────┘

Algorithms Implemented

Algorithm	Year	Purpose	Complexity
SuperMinHash	2017	Fingerprinting	O(n)
LSH Banding	1998	Approximate nearest neighbor	O(1) lookup
PageRank	1996	Authority ranking	O(V + E)
Bloom Filter	1970	Probabilistic membership	O(k)
HyperLogLog	2007	Cardinality estimation	O(1)
Bayesian	-	Multi-signal confidence	O(n)

Each algorithm is implemented in TypeScript and exposed via HTTP endpoints.

MCP Integration

The webflow-mcp server exposes 10 tools for AI agent consumption:

┌─────────────────────────────────────────────────────────────┐
│  AI Agent (Claude, Cursor, etc.)                            │
│                         ↓                                   │
│                   MCP Protocol                              │
│                         ↓                                   │
│                  webflow-mcp                                │
│   plagiarism_scan, plagiarism_pagerank, etc.               │
│                         ↓                                   │
│              Plagiarism Agent Worker                        │
│   https://plagiarism-agent.createsomething.workers.dev     │
└─────────────────────────────────────────────────────────────┘

Tool	What It Does
`plagiarism_scan`	Check URL against indexed templates
`plagiarism_pagerank`	Compute authority rankings
`plagiarism_confidence`	Calculate plagiarism probability
`plagiarism_detect_frameworks`	Identify libraries used
`plagiarism_exclude`	Mark false positive pair

Framework Detection

The system detects 20+ frameworks including:

Animation: GSAP, Lenis, Locomotive, Barba, AOS
Carousel: Swiper, Splide
Design Systems: Client-First, Relume, Lumos
Webflow: Finsweet, Wized, Memberstack

Three-Tier AI System

Reported Case
    ↓
┌────────────────────────────────┐
│  Tier 1: Vision Screening     │  FREE (Workers AI)
│  → Removes 30% obvious        │
└────────────────────────────────┘
    ↓
┌────────────────────────────────┐
│  Tier 2: Detailed Analysis    │  $0.02 (Claude Haiku)
│  → Removes 50% more           │
└────────────────────────────────┘
    ↓
┌────────────────────────────────┐
│  Tier 3: Edge Cases           │  $0.15 (Claude Sonnet)
│  → Handles 20% genuine        │
└────────────────────────────────┘

Validation Results

Unit Tests: 41/41 passing

MinHash/SuperMinHash (10 tests)
LSH Banding (8 tests)
Bayesian Confidence (9 tests)
PageRank (14 tests)

Integration Tests:

Comparison	Vector Similarity	MinHash Similarity
Artifact vs Pathwise	95.2%	50.8%
Prospect vs Pathwise	94.7%	(not compared)
Artifact vs Prospect	96.7%	14.1%

The discrepancy is expected and informative:

Vector (95%): Captures semantic/structural similarity
MinHash (14-50%): Captures character-level copying

Cost Analysis

Approach	Monthly Cost
Manual Review (50 cases)	$625
Automated System	$2.20
Savings	99.6%

Key Insight

Agent-native ≠ AI-only.
Classic algorithms (1970-2017) do the heavy lifting.
AI handles edge cases requiring judgment.
MCP wraps deterministic tools for AI consumption.

What This Proves

✓ MinHash + Vector embeddings provide complementary signals
✓ LSH enables O(1) candidate lookup at scale
✓ PageRank identifies originals vs derivatives
✓ MCP enables any team member's AI to invoke analysis
✓ Three-tier AI optimizes cost/accuracy tradeoff

What This Doesn't Prove

○ Visual similarity (screenshot comparison not yet implemented)
○ Optimal Bayesian weights (weight tuning script created, not validated)
○ Real-time ingestion (webhook integration pending)

Reproducibility

Requirements:

Cloudflare Workers account
D1 database
OpenAI API key (for embeddings)
Anthropic API key (for Tier 2/3)

Deployment:

cd packages/webflow-site-analyzer-mcp
wrangler d1 migrations apply plagiarism-db --local
wrangler deploy

Canon Reflection

Zuhandenheit (ready-to-hand): When the system works correctly, the infrastructure
disappears. Marketplace administrators see decisions in Airtable—not queues, tiers,
or AI models.

Subtractive Architecture: The three-tier system removes work at each stage:

Tier 1 removes the obvious (30%)

Tier 2 removes the analyzable (50%)

Tier 3 handles only genuine edge cases (20%)

Weniger, aber besser: Less human time, better consistency, same quality of decisions.

Conclusion

The hypothesis is validated. Classic CS algorithms (MinHash, LSH, PageRank, Bayesian)
combined with AI tiers create an effective plagiarism detection system at 99.6% cost
reduction. Exposing these tools via MCP enables any team member's AI agent to perform
sophisticated template analysis.

The system embodies the CREATE SOMETHING principle that tools should be agent-native:
designed for AI consumption while keeping humans in control of judgment calls.

"The bridge is a thing that gathers."
— Heidegger, Building Dwelling Thinking

MCP gathers human intent, algorithmic capability, and AI judgment into a unified workflow.
The protocol recedes; the analysis emerges.