Agent Ecosystem Testing

Key Findings for Cursor’s Web Fetch Behavior, Cursor-interpreted


Test Workflow

  1. Run python web_fetch_testing_framework.py --test {test ID} --track interpreted
  2. Review terminal output
  3. Copy the provided prompt requesting agent report @Web* fetch results: character count,
    token estimate, truncation status, content completeness, Markdown formatting integrity
  4. Open a new Cursor session, paste prompt into chat window
  5. Capture agent’s full text response, observations as the interpreted-finding; gap
    between agent’s self report and actual fetch behavior is a finding
  6. Log structured metadata as described in framework-reference.md
  7. Ensure log results saved to /results/cursor-interpreted/results.csv

*Results logged as “Methods tested: @Web” reflect user-facing prompt syntax. Post-analysis revealed testing misused @Web as a fetch command rather than a context attachment. Cursor may autonomously call backend mechanisms WebFetch, mcp_web_fetch regardless of @Web syntax; visit Friction Note for analysis.


Platform Limit Summary

Limit Observed
Hard Character
Limit
None detected: tested up to 702 KB
Hard Token
Limit
None detected: tested up to ~179K tokens,
average 33,912
Output Consistency
Small
High variance: 2-3x across sessions,
1.9 KB → 5.6 KB same URL
Output Consistency
Large
Highly stable: <1% variance across sessions,
245 KB identical across 3 runs
Content Selection Behavior Non-deterministic for small files;
size-dependent
Truncation
Pattern
Respects content boundaries when occurs,
no mid-sentence cuts
JavaScript-heavy
SPAs
Truncation at ~6 KB, ~1.5K tokens;
free tier times out, Pro tier truncates cleanly
Redirect Chains Successfully follows, tested 5-level redirect chain
Self-reported Completeness Unreliable: model claims “full content” when returning subset

Results Details

Model Auto
Total Tests 26
Distinct URLs 13
Input Size Range 2 KB–256 KB
Truncation Detection Model assertion, verbatim last-50-chars, Markdown integrity

Cross-run Output Variance

Test Category Run 1
chars
Run 2
chars
Run 3
chars
Variance
BL-1 Small - 87 KB 1,953 5,595 4,100 2.9x
BL-2 Small - 20 KB 1,953 4,200 4,350 2.2x
SC-2 Large - 80 KB 702,885 702,885 702,885 1.0x
OP-4 Large - 256 KB 245,000 245,465 245,466 1.0x
EC-1 SPA - 100 KB 0 - timeout 5,857 null null

Truncation Analysis

# Finding Tests Observed Spec
1 JavaScript-heavy SPAs truncation ceiling EC-1
r1 & r2
multiple sizes
Free tier: timeout - 0 bytes; Pro tier: truncated at 5,857 chars, ~1.5K tokens, clean ending at last link block; suggests ~6KB or ~1.5K token ceiling specifically for SPA endpoints SPAs truncated aggressively, not completely blocked; free tier timeouts mask Pro tier truncation behavior
2 Static HTML/Markdown pages have no detected ceiling BL-1 through OP-4
SC-2 - 702 KB
OP-4 - 245 KB
Successfully returned 702,885 characters from SC-2; 245,465 characters from OP-4; no truncation observed on static content No practical character ceiling detected for static docs; tested up to 700 KB
3 Output consistency size-dependent BL-1
BL-2
SC-2
OP-4
Small files, 1-20 KB: 2-3× variance across sessions, 1.9K→5.6K; large files, 80-256 KB: <1% variance, 702.8K identical, 245.5K identical Fetch behavior reliability depends on size - small docs unreliable, large docs stable
4 Content selection is non-deterministic for small files, session-dependent BL-1
r1-r4
BL-2
r1-r3
Identical prompts in different chat sessions produced 1,953 → 5,595 → 4,100 → 5,500 chars on BL-1; new sessions returned larger content than original session New chat sessions influence @Web output; conversation state affects fetch behavior
5 Same logical content, different formats, different sizes BL-1 HTML vs BL-2 Markdown
both r1
Both returned 1,953 chars despite different source format, HTML vs .md; later runs diverged - 5,595 vs 4,200 - suggesting format-dependent processing Format affects fetch behavior; may process HTML and Markdown sources differently
6 Intelligent content filtering, not hard truncation SC-4,
EC-6
SC-4, 30 KB page returned 28 KB excluding footer/nav/metadata; EC-6 returned full 71 KB including complex Markdown; always ends at section boundaries For static content doesn’t truncate mid-content, filters non-essential structural elements, while preserving docs integrity
7 Agent’s self-reported completeness diverges from actual content SC-3
BL-1
r3-r4
EC-1
r2
SC-3: Agent reports “no truncation, complete reference” but content cuts mid-references section;
EC-1 r2: Agent acknowledges truncation at ~6 KB despite 100 KB expected
Self-report of content completeness unreliable, agent perceives filtered excerpts as “complete” because internally valid
8 Redirect chains handled transparently EC-3 5-level redirect chain successfully followed; returned final destination content - 850 chars JSON without truncation Follows HTTP redirects without user awareness or latency penalty
9 H2 supported for static content SC-2
OP-4
EC-6
BL-3
Token counts range BL-1 488 to SC-2 175,721 with no observable limit; successfully returned 61K token document, OP-4, multiple times identically For static pages: if token-based, ceiling is extremely high - 200K+; effectively no practical limit
10 H3 confirmed for static content BL-1
BL-2
SC-2
SC-3
SC-4
r1-r3
8 tests matched H3: content selection respects Markdown section boundaries; truncation occurs at header boundaries, code fence closes, list endings For static pages, uses intelligent, structure-aware content selection rather than char/token-based cutting

Size-Dependent Behavior

While the exact bifurcation point is unclear, Cursor behavior shows divergent patterns.
Variance may depend on content type, structure, and size.

Characteristic High-Variance Cases Stable Cases
Examples BL-1 87 KB
BL-2 20 KB
SC-1 40 KB
SC-2 80 KB - 702 KB
OP-4 256 KB - 245 KB
EC-6 61 KB
Consistency 2-3× variance across sessions <1% variance across sessions
Session
Dependency
New chat
different results
Reproducible
same URL = same content
Reliability Unreliable Offer more consistency

Perception Gap

User char/token count comparisons to detect content subsettings, not agent self-report

Test Size Returned Reported Gap Why “Complete”
SC-3
Wikipedia
100 KB+ 38 KB “Complete reference” 62% missing Clean section boundary
masks truncation
BL-1
MongoDB
87 KB 1.9K B “Internally valid” 95% missing No mid-sentence cutoff,
valid Markdown
SC-4
Markdown Guide
65 KB 28 KB “All syntax sections” 57% missing Footer intentionally filtered, excerpt coherent