Key Findings for Cursor’s Web Fetch Behavior, Raw
1. Run `python web_fetch_testing_framework.py --test {test ID} --track raw`
2. Review the terminal output
3. Copy the provided prompt asking `@Web`* to fetch the URL and save verbatim output
4. Open a new Cursor chat session and paste the prompt into the chat window
5. Examine the saved `raw_output{test ID}.txt` file
6. Compute exact metrics - run `python3 web_fetch_verify_raw_results.py {test ID}`
7. Log structured metadata with precise measurements - file size, MD5, token count via tiktoken
8. Ensure log results are saved to `cursor-web-fetch/results/raw/results.csv`
*All results logged as “Methods tested:
@Web” reflect user-facing syntax used in prompts. However, post-analysis revealed@Webwas misused as a fetch command rather than a context attachment mechanism. The actual backend mechanisms:WebFetch,mcp_web_fetchmay have been invoked autonomously by Cursor regardless of@Web syntax- read Friction Note for full impact analysis.
Platform Limit Summary
| Limit | Observed |
|---|---|
| Hard Character Limit | Method-dependent: ~28KB - WebFetch MCP, ~72KB - urllib, none detected for other paths; tested up to 17 MB |
| Hard Token Limit | None detected - tested up to 6.68M tokens - SC-2 raw HTML |
| Output Consistency Same URL |
Perfect reproducibility: BL-1/BL-2 identical across runs,same MD5, BL-3/OP-4 identical, same MD5 |
| Content Conversion Pattern | Non-deterministic: Simple docs → Markdown - BL-1, SC-1, OP-3; Complex/timeout → raw HTML - SC-2;Raw Markdown → pass-through - EC-6 |
| Truncation Pattern | When occurs, method-specific: WebFetch MCP ~28KB, urllib ~72KB; respects structure - ends mid-word or at boundaries |
| Chars/Token Ratio Range |
2.62 - JSON to 4.36 - clean Markdown - strong indicator of content type |
| Reference List Filtering | Deterministic selection: Wikipedia 252 refs → consistently selects ref #14, which is the first commercial source after govt sources |
| Redirect Chains | Successfully follows, tested 5-level redirect chain |
| Backend Routing | Multiple fetch paths: WebFetch - MCP-style, urllib, curl fallback; each with different size limits |
Results Details
| Model | Auto |
| Total Tests | 27 |
| Distinct URLs | 13 |
| Input Size Range | 2KB–256KB - expected raw source |
| Output Size Range | 1KB–17.6 MB actual converted/fetched |
| Truncation Detection | MD5 comparison, hexdump analysis, fence/brace counting, mid-word detection |
Content Conversion Patterns
| Test | Input Type | Expected | Received | Format | Conversion |
|---|---|---|---|---|---|
| BL-1 | HTML | 85KB | 4.8KB | Markdown | 94% reduction |
| BL-2 | Markdown | 20KB | 4.8KB | Markdown | 76% reduction |
| SC-2 | HTML - complex | 80KB | 17.6 MB | Raw HTML/JS | 22,000% expansion - timeout→curl fallback |
| SC-3 | HTML | 100KB | 38KB | Markdown | 62% reduction + ref filtering |
| OP-4 | HTML | 250KB | 245KB | Markdown | 2% reduction |
| EC-6 | Raw .md |
60KB | 73KB | Markdown pass-through |
22% expansion version drift |
Chars/Token Ratio Analysis
| Content Type | Chars/Token | Tests | Interpretation |
|---|---|---|---|
| Clean Markdown Prose | 4.13–4.36 | BL-1, BL-2, SC-1, EC-1, EC-6 |
Efficient encoding, natural language |
| Documentation + Code | 3.91–4.37 | SC-4, OP-4, BL-3 |
Mixed content, moderate efficiency |
| Table-Heavy Data | 3.06 | SC-3 |
Repetitive structure, less efficient |
| Raw HTML/JS | 2.65 | SC-2 |
Heavy markup, symbols, very inefficient |
| JSON | 2.62 | EC-3 |
Structural chars, lowest efficiency |
HTTP Content Negotiation
Cursor’s web fetch mechanisms request text/markdown via the Accept header,
signaling a preference for Markdown over HTML when the server supports content
negotiation. Cursor sends Accept: text/markdown, text/html... with
Markdown listed first - highest implicit q value, with HTML and other types as
fallback preferences. Impact on results:
- Servers that ignore
Accept, typical for normal websites, still return HTML - Servers that support content negotiation, some “Markdown-first” or agent-oriented
setups may return
Content-Type: text/markdown; Cursor can use without HTML cleanup - Raw track result artifacts show this header structure, such as
raw_output_EC-3.txt:
"Accept": "text/markdown,text/html;q=0.9,application/xhtml+xml;q=0.8,application/xml;q=0.7"
Observed behavior:
| Test | Server Response | Cursor Behavior | Output |
|---|---|---|---|
| EC-6: GitHub raw .md URL |
Content-Type: text/plain; charset=utf-8 |
Passed through as Markdown |
73KB |
| BL-1: HTML docs | HTML | Converted to Markdown | 4.8KB from 85KB source |
SC-2: timeout → curl fallback |
HTML | No conversion | 17.6 MB raw HTML |
Truncation Analysis
| # | Finding | Tests | Observed | Spec |
|---|---|---|---|---|
| 1 | Truncation limits are fetch-method-dependent, not universal |
SC-4 ~28KB,EC-6 ~72KB,OP-4 245KB |
SC-4 WebFetch MCP truncated at 27,890 chars; EC-6 urllib truncated at 72,600 chars; OP-4/BL-3, different path, no truncation at 245KB |
@Web routes to multiple backends with different size constraints: WebFetch MCP ~28KB ceiling, urllib ~72KB ceiling, other paths 240KB+ no ceiling detected |
| 2 | Markdown conversion is format-agnostic |
BL-1 HTML vsBL-2 .md |
Both URLs return identical 4,817-byte output, same MD5 despite different source formats |
@Web normalizes HTML and Markdown sources to identical output - conversion pipelineis format-blind |
| 3 | Perfect reproducibility for same URL | BL-1 BL-2 BL-3 OP-4 |
Identical MD5 checksums across multiple runs on same URL; OP-4 & BL-3 |
Raw track has perfect run-to-run consistency - same URL always produces identical output - same MD5, same byte count |
| 4 | Intelligent reference filtering, not truncation |
SC-3 |
Wikipedia page with 252 references consistently returns reference #14 “Moody’s Analytics” - first commercial/corporate source after 13 govt/institutional sources | @Web applies deterministic content heuristics: preserves core content, filters govt/academic refs, selects first commercial source |
| 5 | Complex pages may trigger curl fallback |
SC-2 EC-1 |
WebFetch timeout → autonomous curl fallback; returns 16-17 MB raw HTML instead of filtered Markdown |
On timeout, @Web may substitute curl, returning unfiltered HTML - output format/size unpredictable on complex pages |
| 6 | Chars/token ratio reliably indicates content type | All tests | Strong correlation: JSON 2.62, Raw HTML 2.65, Tables 3.06, Docs 3.91-4.37; <3.0 = code/markup, >4.0 = prose | Chars/token metric enables content-type classification without parsing - useful for automated analysis |
| 7 | Large documents have minimal conversion overhead | OP-4, BL-3245KB output |
Expected 250KB, received 245KB Markdown - only 2% reduction despite rich structure - 241 code blocks, 237 headers | @Web preserves large structured docs nearly verbatim - no aggressive filtering on multi-section tutorials |
| 8 | Truncation respects structure when it occurs | SC-4, EC-6 |
SC-4 ends mid-word “updated” - alphanumeric final char; EC-6 ends mid-sentence but clean UTF-8 boundary; both incomplete but structurally valid |
When truncation occurs, may cut mid-content but preserves character boundaries - no corrupted UTF-8 |
| 9 | JS-heavy SPAs extract rendered content | EC-1 |
Expected 100KB raw HTML/JS, received 5.7KB Markdown - 94% reduction; successfully extracted doc content despite SPA architecture | @Web handles client-side rendering - extracts semantic content, strips JS overhead |
| 10 | No token-based ceiling detected | SC-26.68M tokens |
Successfully returned 17.6 MB - 6,680,678 tokens raw HTML - no evidence of token-based truncation | If token limits exist, ceiling is extremely high - 7M+; character/method limits dominate |
Method-Specific Behavior
| Fetch Backend | Identified In | Size Limit | Conversion | Reliability |
|---|---|---|---|---|
WebFetch MCP |
SC-4SC-3,OP-3 |
~28KB | Markdown | High - consistent results |
urllib.request |
EC-6 |
~72KB | Pass-through .md |
High - clean truncation boundary |
curl |
SC-2,EC-1 |
None detected - 17 MB+ | Raw HTML - no conversion | Low - only on timeout |
| Unknown path | OP-4,BL-3 |
None detected 245KB+ | Markdown | High - perfect reproducibility |
Content Filtering Heuristics
Cursor applies intelligent content selection beyond basic truncation:
| Heuristic | Example | Behavior |
|---|---|---|
| Reference Deduplication | SC-3 Wikipedia |
252 refs → 1 commercial source - deterministic |
| Footer/Nav Stripping | SC-4 Markdown Guide |
30KB page → 28KB excluding footer/metadata |
| Boilerplate Reduction | BL-1 MongoDB HTML |
85KB → 4.8KB, strips CSS, nav, sidebar |
| Core Content Preservation | OP-4 Tutorial |
250KB → 245KB, 241 code blocks intact |
Perception Gap
Raw track measurements reveal that the Cursor-interpreted track underreports:
| Test | Raw Track | Interpreted Track | Gap |
|---|---|---|---|
BL-1 |
4,817 chars | 1,953 chars r1 |
Interpreted shows subset; UI reformats Markdown |
SC-2 |
17,691,628 chars | 702,885 chars r2 |
Interpreted shows filtered; raw shows curl fallback |
OP-4 |
245,465 chars | 245,453 chars | Near-perfect match on large docs |
Implication for agents: Raw track provides ground truth measurements;
interpreted track reflects UI-rendered subset