Key Findings for Cursor’s Web Fetch Behavior, Raw

Test Workflow

Run python web_fetch_testing_framework.py --test {test ID} --track raw
Review terminal output
Copy provided prompt requesting @Web* fetch the URL, save verbatim output
Open a new Cursor session in VS Code, paste prompt into the chat window
Examine saved raw_output{test ID}.txt file
Run python3 web_fetch_verify_raw_results.py {test ID} to calculate metrics
Log structured metadata with metrics as described in framework-reference.md
Ensure log results saved to /results/raw/results.csv

*Results logged as “Methods tested: @Web” reflect user-facing, prompt syntax. Post-analysis revealed testing misuse of @Web as a fetch command rather than a context attachment. Cursor may autonomously call backend mechanisms WebFetch, mcp_web_fetch regardless of @Web syntax, visit Friction Note for analysis.

Platform Limit Summary

Limit	Observed
Hard Character Limit	Method-dependent: `WebFetch MCP` ~28 KB, `urllib` ~72 KB, none detected for other paths; tested up to 17 MB
Hard Token Limit	None detected - tested up to 6.68 M tokens with `SC-2` raw HTML
Output Consistency Same URL	Perfect reproducibility: `BL-1`/`BL-2` identical across runs, same MD5, `BL-3`/`OP-4` identical, same MD5
Content Conversion Pattern	Non-deterministic: simple docs → Markdown `BL-1`, `SC-1`, `OP-3`; complex/timeout → raw HTML - `SC-2`; raw Markdown → pass-through - `EC-6`
Truncation Pattern	Method-specific: `WebFetch MCP` ~28 KB, `urllib` ~72 KB; respects structure, ends mid-word or at boundaries
Chars/Token Ratio Range	JSON 2.62 to clean Markdown 4.36 - strong indicator of content type
Reference List Filtering	Deterministic selection: Wikipedia 252 refs → consistently selects ref #14, the first commercial source after govt sources
Redirect Chains	Successfully follows, tested 5-level redirect chain
Backend Routing	Multiple fetch paths: `WebFetch` - `MCP-style`, `urllib`, `curl` fallback; each with different size limits

Results Details

Model	`Auto`
Total Tests	27
Distinct URLs	13
Input Size Range	2 KB–256 KB - expected raw source
Output Size Range	1 KB–17.6 MB actual converted/fetched
Truncation Detection	MD5 comparison, hexdump analysis, fence/brace counting, mid-word detection

Content Conversion Patterns

Test	Input Type	Expected	Received	Format	Conversion
BL 1	HTML	85 KB	4.8 KB	Markdown	94% reduction
BL 2	Markdown	20 KB	4.8 KB	Markdown	76% reduction
SC 2	HTML complex	80 KB	17.6 MB	Raw HTML/JS	22,000% expansion, timeout→`curl` fallback
SC 3	HTML	100 KB	38 KB	Markdown	62% reduction , ref filtering
OP 4	HTML	250 KB	245 KB	Markdown	2% reduction
EC 6	Raw `.md`	60 KB	73 KB	Markdown pass-through	22% expansion version drift

Chars/Token Ratio Analysis

Content Type	Chars/Token	Tests	Interpretation
Clean Markdown Prose	4.13–4.36	`BL-1`, `BL-2`, `SC-1`, `EC-1`, `EC-6`	Natural language, efficient encoding
Documentation with Code	3.91–4.37	`SC-4`, `OP-4`, `BL-3`	Mixed content, moderate efficiency
Table-Heavy Data	3.06	`SC-3`	Repetitive structure, less efficient
Raw HTML/JS	2.65	`SC-2`	Heavy markup, symbols, very inefficient
JSON	2.62	`EC-3`	Structural chars, lowest efficiency

HTTP Content Negotiation

Cursor’s web fetch mechanisms request text/markdown via the Accept header, signaling a preference for Markdown over HTML when the server supports content negotiation. Cursor sends Accept: text/markdown, text/html... with Markdown listed first - highest implicit q value, with HTML and other types as fallback preferences. Impact on results:

Servers that ignore Accept, typical for normal websites, still return HTML
Servers that support content negotiation, some “Markdown-first” or agent-oriented setups may return Content-Type: text/markdown; Cursor can use without HTML cleanup
Raw track result artifacts show this header structure, such as raw_output_EC-3.txt:
"Accept": "text/markdown,text/html;q=0.9,application/xhtml+xml;q=0.8,application/xml;q=0.7"

Test	Server Response	Cursor Behavior	Output
`EC-6` GitHub raw `.md`	`Content-Type: text/plain; charset=utf-8`	Passed through as Markdown	73 KB
`BL-1` HTML docs	HTML	Converted to Markdown	4.8 KB from 85 KB source
`SC-2` timeout→`curl` fallback	HTML	No conversion	17.6 MB raw HTML

Truncation Analysis

#	Finding	Tests	Observed	Spec
1	Truncation limits fetch-method-dependent, not universal	`SC-4` `EC-6` `OP-4`	`SC-4` `WebFetch MCP` truncated at 27,890 chars; `EC-6` `urllib` truncated at 72,600 chars; `OP-4`/`BL-3`, different path, no truncation at 245 KB	`@Web` routes to multiple backends with different size constraints: `WebFetch MCP` ~28 KB ceiling, `urllib` ~72 KB ceiling, other paths 240 KB+ no ceiling detected
2	Markdown conversion format-agnostic	`BL-1` HTML `BL-2` `.md`	Both URLs return identical 4,817-byte output, same `MD5` despite different source formats	`@Web` normalizes HTML and Markdown sources to identical output, conversion pipeline format-blind
3	Perfect reproducibility for same URL	`BL-1` `BL-2` `BL-3` `OP-4`	Identical MD5 checksums across multiple runs on same URL	Raw track has perfect run-to-run consistency - same URL always produces identical output - same MD5, same byte count
4	Intelligent reference filtering, not truncation	`SC-3`	Wikipedia page with 252 references consistently returns reference #14 “Moody’s Analytics” first commercial source after 13 institutional sources	`@Web` applies deterministic content heuristics: preserves core content, filters govt/academic refs
5	Complex pages may trigger `curl` fallback	`SC-2` `EC-1`	`WebFetch` timeout → autonomous `curl` fallback; returns 16-17 MB raw HTML instead of filtered Markdown	On timeout, `@Web` may substitute `curl`, returning unfiltered HTML - output format/size unpredictable on complex pages
6	Chars/token ratio reliably indicates content type	All tests	Strong correlation: JSON 2.62, Raw HTML 2.65, Tables 3.06, Docs 3.91-4.37; <3.0 = code/markup, >4.0 = prose	Chars/token metric enables content-type classification without parsing - useful for automated analysis
7	Large docs, minimal conversion overhead	`OP-4` `BL-3`	Expected 250 KB, received 245 KB Markdown, only 2% reduction despite rich structure, 241 code blocks, 237 headers	`@Web` preserves large structured docs nearly verbatim, no aggressive filtering on multi-section tutorials
8	Truncation respects structure	`SC-4` `EC-6`	`SC-4` ends mid-word “updated”, alphanumeric final char; `EC-6` ends mid-sentence. but clean `UTF-8` boundary; both incomplete but structurally valid	When truncation occurs, may cut mid-content, but preserves character boundaries
9	JS-heavy SPAs extract rendered content	`EC-1`	Expected 100 KB raw HTML/JS, received 5.7 KB Markdown - 94% reduction; successfully extracted doc content despite SPA architecture	`@Web` handles client-side rendering - extracts semantic content, strips JS overhead
10	Token-based ceiling not detected	`SC-2`	Successfully returned 17.6 MB - 6,680,678 tokens raw HTML	If token limits exist, ceiling is extremely high - 7M+; char/method limits dominate

Method-Specific Behavior

Fetch Backend	Identified	Size Limit	Conversion	Reliability
`WebFetch MCP`	`SC-4` `SC-3` `OP-3`	~28 KB	Markdown	High - consistent results
`urllib.request`	`EC-6`	~72 KB	Pass-through `.md`	High - clean truncation boundary
`curl`	`SC-2` `EC-1`	None detected 17 MB+	Raw HTML no conversion	Low - only on timeout
Unknown Path	`OP-4` `BL-3`	None detected 245 KB+	Markdown	High - perfect reproducibility

Content Filtering Heuristics

Beyond basic truncation, Cursor applies intelligent content selection

Heuristic	Example	Behavior
Reference Deduplication	`SC-3` Wikipedia	Deterministic 252 refs → 1 commercial source
Footer/Nav Stripping	`SC-4` Markdown Guide	Reduction from 30 KB page → 28 KB
Boilerplate Reduction	`BL-1` MongoDB HTML	Reduction from 85 KB → 4.8 KB
Core Content Preservation	`OP-4` Tutorial	241 Code blocks intact 250 KB → 245 KB

Perception Gap

Raw track measurements reveal that the Cursor-interpreted track under-reports

Test	Raw Track	Interpreted Track	Gap
`BL-1`	4,817 chars	Run 1 1,953 chars	Interpreted shows subset; UI reformats Markdown
`SC-2`	17,691,628 chars	Run 2 702,885 chars	Interpreted shows filtered; raw shows `curl` fallback
`OP-4`	245,465 chars	245,453 chars	Near-perfect match on large docs