Key Findings for Cursor’s Web Fetch Behavior, Raw
Test Workflow
- Run
python web_fetch_testing_framework.py --test {test ID} --track raw - Review terminal output
- Copy provided prompt requesting
@Web* fetch the URL, save verbatim output - Open a new Cursor session in VS Code, paste prompt into the chat window
- Examine saved
raw_output{test ID}.txtfile - Run
python3 web_fetch_verify_raw_results.py {test ID}to calculate metrics - Log structured metadata with metrics as described in
framework-reference.md - Ensure log results saved to
/results/raw/results.csv
*Results logged as “Methods tested:
@Web” reflect user-facing, prompt syntax. Post-analysis revealed testing misuse of@Webas a fetch command rather than a context attachment. Cursor may autonomously call backend mechanismsWebFetch,mcp_web_fetchregardless of@Websyntax, visit Friction Note for analysis.
Platform Limit Summary
| Limit | Observed |
|---|---|
| Hard Character Limit |
Method-dependent: WebFetch MCP ~28 KB, urllib ~72 KB,none detected for other paths; tested up to 17 MB |
| Hard Token Limit |
None detected - tested up to 6.68 M tokens with SC-2 raw HTML |
| Output Consistency Same URL |
Perfect reproducibility: BL-1/BL-2 identical across runs,same MD5, BL-3/OP-4 identical, same MD5 |
| Content Conversion Pattern |
Non-deterministic: simple docs → Markdown BL-1, SC-1, OP-3;complex/timeout → raw HTML - SC-2;raw Markdown → pass-through - EC-6 |
| Truncation Pattern |
Method-specific: WebFetch MCP ~28 KB, urllib ~72 KB;respects structure, ends mid-word or at boundaries |
| Chars/Token Ratio Range |
JSON 2.62 to clean Markdown 4.36 - strong indicator of content type |
| Reference List Filtering |
Deterministic selection: Wikipedia 252 refs → consistently selects ref #14, the first commercial source after govt sources |
| Redirect Chains |
Successfully follows, tested 5-level redirect chain |
| Backend Routing |
Multiple fetch paths: WebFetch - MCP-style, urllib, curl fallback;each with different size limits |
Results Details
| Model | Auto |
| Total Tests | 27 |
| Distinct URLs | 13 |
| Input Size Range | 2 KB–256 KB - expected raw source |
| Output Size Range | 1 KB–17.6 MB actual converted/fetched |
| Truncation Detection | MD5 comparison, hexdump analysis, fence/brace counting, mid-word detection |
Content Conversion Patterns
| Test | Input Type | Expected | Received | Format | Conversion |
|---|---|---|---|---|---|
| BL 1 |
HTML | 85 KB | 4.8 KB | Markdown | 94% reduction |
| BL 2 |
Markdown | 20 KB | 4.8 KB | Markdown | 76% reduction |
| SC 2 |
HTML complex |
80 KB | 17.6 MB | Raw HTML/JS | 22,000% expansion, timeout→ curl fallback |
| SC 3 |
HTML | 100 KB | 38 KB | Markdown | 62% reduction , ref filtering |
| OP 4 |
HTML | 250 KB | 245 KB | Markdown | 2% reduction |
| EC 6 |
Raw.md |
60 KB | 73 KB | Markdown pass-through |
22% expansion version drift |
Chars/Token Ratio Analysis
| Content Type | Chars/Token | Tests | Interpretation |
|---|---|---|---|
| Clean Markdown Prose |
4.13–4.36 | BL-1, BL-2, SC-1,EC-1, EC-6 |
Natural language, efficient encoding |
| Documentation with Code |
3.91–4.37 | SC-4, OP-4, BL-3 |
Mixed content, moderate efficiency |
| Table-Heavy Data |
3.06 | SC-3 |
Repetitive structure, less efficient |
| Raw HTML/JS |
2.65 | SC-2 |
Heavy markup, symbols, very inefficient |
| JSON | 2.62 | EC-3 |
Structural chars, lowest efficiency |
HTTP Content Negotiation
Cursor’s web fetch mechanisms request text/markdown via the Accept header,
signaling a preference for Markdown over HTML when the server supports content
negotiation. Cursor sends Accept: text/markdown, text/html... with
Markdown listed first - highest implicit q value, with HTML and other types as
fallback preferences. Impact on results:
- Servers that ignore
Accept, typical for normal websites, still return HTML - Servers that support content negotiation, some “Markdown-first” or agent-oriented
setups may return
Content-Type: text/markdown; Cursor can use without HTML cleanup - Raw track result artifacts show this header structure, such as
raw_output_EC-3.txt:"Accept": "text/markdown,text/html;q=0.9,application/xhtml+xml;q=0.8,application/xml;q=0.7"
| Test | Server Response | Cursor Behavior | Output |
|---|---|---|---|
EC-6GitHub raw .md |
Content-Type: text/plain; charset=utf-8 |
Passed through as Markdown |
73 KB |
BL-1HTML docs |
HTML | Converted to Markdown | 4.8 KB from 85 KB source |
SC-2timeout→ curl fallback |
HTML | No conversion | 17.6 MB raw HTML |
Truncation Analysis
| # | Finding | Tests | Observed | Spec |
|---|---|---|---|---|
| 1 | Truncation limits fetch-method-dependent, not universal |
SC-4EC-6OP-4 |
SC-4 WebFetch MCP truncated at 27,890 chars; EC-6 urllib truncated at 72,600 chars;OP-4/BL-3, different path,no truncation at 245 KB |
@Web routes to multiple backends with different size constraints: WebFetch MCP ~28 KB ceiling, urllib ~72 KB ceiling, other paths 240 KB+no ceiling detected |
| 2 | Markdown conversion format-agnostic |
BL-1 HTML BL-2 .md |
Both URLs return identical 4,817-byte output, same MD5 despite different source formats |
@Web normalizes HTML and Markdown sources to identical output,conversion pipeline format-blind |
| 3 | Perfect reproducibility for same URL | BL-1 BL-2 BL-3 OP-4 |
Identical MD5 checksums across multiple runs on same URL | Raw track has perfect run-to-run consistency - same URL always produces identical output - same MD5, same byte count |
| 4 | Intelligent reference filtering, not truncation |
SC-3 |
Wikipedia page with 252 references consistently returns reference #14 “Moody’s Analytics” first commercial source after 13 institutional sources | @Web applies deterministic content heuristics: preserves core content, filters govt/academic refs |
| 5 | Complex pages may trigger curl fallback |
SC-2 EC-1 |
WebFetch timeout → autonomous curl fallback; returns 16-17 MB raw HTML instead of filtered Markdown |
On timeout, @Web may substitute curl, returning unfiltered HTML - output format/size unpredictable on complex pages |
| 6 | Chars/token ratio reliably indicates content type | All tests | Strong correlation: JSON 2.62, Raw HTML 2.65, Tables 3.06, Docs 3.91-4.37; <3.0 = code/markup, >4.0 = prose | Chars/token metric enables content-type classification without parsing - useful for automated analysis |
| 7 | Large docs, minimal conversion overhead | OP-4 BL-3 |
Expected 250 KB, received 245 KB Markdown, only 2% reduction despite rich structure, 241 code blocks, 237 headers |
@Web preserves large structured docs nearly verbatim, no aggressive filtering on multi-section tutorials |
| 8 | Truncation respects structure | SC-4 EC-6 |
SC-4 ends mid-word “updated”, alphanumeric final char; EC-6 ends mid-sentence. but clean UTF-8 boundary; both incomplete but structurally valid |
When truncation occurs, may cut mid-content, but preserves character boundaries |
| 9 | JS-heavy SPAs extract rendered content | EC-1 |
Expected 100 KB raw HTML/JS, received 5.7 KB Markdown - 94% reduction; successfully extracted doc content despite SPA architecture |
@Web handles client-side rendering - extracts semantic content, strips JS overhead |
| 10 | Token-based ceiling not detected | SC-2 |
Successfully returned 17.6 MB - 6,680,678 tokens raw HTML | If token limits exist, ceiling is extremely high - 7M+; char/method limits dominate |
Method-Specific Behavior
| Fetch Backend | Identified | Size Limit | Conversion | Reliability |
|---|---|---|---|---|
WebFetch MCP |
SC-4SC-3OP-3 |
~28 KB | Markdown | High - consistent results |
urllib.request |
EC-6 |
~72 KB | Pass-through .md |
High - clean truncation boundary |
curl |
SC-2EC-1 |
None detected 17 MB+ |
Raw HTML no conversion |
Low - only on timeout |
| Unknown Path | OP-4BL-3 |
None detected 245 KB+ |
Markdown | High - perfect reproducibility |
Content Filtering Heuristics
Beyond basic truncation, Cursor applies intelligent content selection
| Heuristic | Example | Behavior |
|---|---|---|
| Reference Deduplication |
SC-3Wikipedia |
Deterministic 252 refs → 1 commercial source |
| Footer/Nav Stripping |
SC-4Markdown Guide |
Reduction from 30 KB page → 28 KB |
| Boilerplate Reduction |
BL-1MongoDB HTML |
Reduction from 85 KB → 4.8 KB |
| Core Content Preservation |
OP-4Tutorial |
241 Code blocks intact 250 KB → 245 KB |
Perception Gap
Raw track measurements reveal that the Cursor-interpreted track under-reports
| Test | Raw Track | Interpreted Track | Gap |
|---|---|---|---|
BL-1 |
4,817 chars | Run 1 1,953 chars |
Interpreted shows subset; UI reformats Markdown |
SC-2 |
17,691,628 chars | Run 2 702,885 chars |
Interpreted shows filtered; raw shows curl fallback |
OP-4 |
245,465 chars | 245,453 chars | Near-perfect match on large docs |
Agent Ecosystem Testing