Cursor-Interpreted vs Raw
Track Design
Interpreted track captures what the agent believes it retrieved: how much content it saw, whether the fetch was complete, how it characterizes truncation. This is the agent’s self-report.
Raw track captures what Cursor actually saved to disk: exact byte counts, hexdump analysis, MD5 checksums, and token counts. These are filesystem measurements, cryptographic hashes, and not agent estimates.
The gap between the two tracks is a finding. If Cursor reports “content complete” in prose, but the raw data shows truncation, that discrepancy belongs in the spec.
| Interpreted | Raw | |
|---|---|---|
| Measures | Agentic retrieval interpretation | Filesystem measurements of saved output |
| Character Counts | Agent estimates, vary 2-3× across sessions on small files | wc -c on disk - exact,reproducible |
| Completeness | Agentic truncation assessment in prose |
MD5 comparison, hexdump analysis, fence counting |
| Token Counts |
Agent estimates, ~4 chars/token assumption |
OpenAI encoding with tiktoken |
| Reproducibility | High variance on small docs, 1.9KB→5.6KB same URL | Perfect reproducibility, same URL = same MD5 |
| Output Format |
Chat UI Markdown rendering | Raw file on disk, raw_output_{test_id}.txt |
| Best For | Understanding agent perception gaps |
Citable measurements for the spec |
Key Observations
-
Reproducibility in Raw vs High Variance in Interpreted
Raw: Same URL produces identical output
BL-1: MD5d6ad8451d3778bf3544574431203a3a7across 2 runsOP-4/BL-3: MD5554eb56e8416d86d12af17a2dfe6f815across 3 runs- Character-for-character identical output on disk
Interpreted: Same URL produces 2-3× variance on small files
BL-1r1: 1,953 chars → r2: 5,595 chars → r3: 4,100 chars, 2.9× varianceBL-2r1: 1,953 chars → r2: 4,200 chars → r3: 4,350 chars, 2.2× variance
Conclusion: the variance is in how Cursor displays content in chat UI, not what it fetches. Raw measurements prove the underlying fetch is deterministic, but that interpreted track shows UI rendering isn’t.
-
Size-Dependent Consistency
Small file rendering appears unreliable, while larger ones seem stable:
Interpreted:
- Small files, 20-87 KB: high session-to-session variance
- Large files, 245 KB: <1% variance, nearly identical across runs
Raw:
- Small files, 4.8 KB output: identical MD5s despite variance in interpreted display
- Large files, 245 KB output: identical MD5s, consistent across runs
Conclusion: Cursor fetches consistently at all sizes. The interpreted variance on small files is possibly a UI rendering artifact, not entirely reflective of fetch behavior.
-
Perception Gap: Model Self-Report is Unreliable
Agent claims “complete” or “no truncation” when content is a filtered subset:
Test Raw Interpreted Gap SC338 KB
truncated
at ref #14/252“Complete
reference
section”Agent interprets filtered
list as completeBL14,817 B
calculated1,953 chars displayed UI shows subset, agent
reports what it seesSC4Truncated mid-word at “updated” “All syntax
sections
present”Clean structure masks incompleteness Conclusion: trust character counts, not prose assertions; agent perceives filtered excerpts as complete because they’re internally coherent.
-
Method-Specific Truncation Limits - Raw
WebFetch,MCP-style: ~28 KB ceiling,SC-4truncated at 27,890 charsurllib.request: ~72 KB ceiling,EC-6truncated at 72,600 charscurl fallback: No ceiling detected,SC-2returned 17.6 MB- Unknown Path: No ceiling detected,
OP-4/BL-3returned 245 KB
Conclusion: Cursor routes to many mechanisms with different limits. The interpreted track didn’t identify this because the agent’s self-report didn’t consistently include its toolchain.
-
Intelligent Content Filtering
Cursor performs structure-aware filtering, but the raw track provided the measurements:
Interpreted: agent reports receiving “main content” but missing footer/navigation
Raw: proves it via byte counts
BL-1: 85 KB HTML → 4.8KB Markdown, 94% reduction, CSS/navigation strippedSC-3: 252 references → deterministically selects #14, the first commercial sourceSC-4: 30 KB page → 28KB, footer/metadata filtered
Conclusion: Cursor applies content heuristics, not blind truncation. Raw track quantifies what interpreted track observes qualitatively.
-
Chars/Token Ratio as Content-Type Classifier - Raw
EC-3: JSON: 2.62 chars/tokenSC-2: Raw HTML/JS: 2.65 chars/tokenSC-3: Tables: 3.06 chars/tokenSC-4,OP-4: Docs + code: 3.91-4.37 chars/tokenBL-1,BL-2,SC-1: Clean Markdown: 4.13-4.36 chars/token
Conclusion: Chars/token ratio enables content-type classification without parsing. <3.0 = code/markup, >4.0 = prose. Useful for automated analysis pipelines. The interpreted track had no visibility into this pattern.
-
Cross-Track Agreement on Redirect Handling
Interpreted: agent received final destination JSON content
Raw: confirmed 5-level redirect chain traversed - 1,021 bytes JSON saved
Conclusion: redirect handling is robust across both measurement approaches
Implications for Agent Developers, Docs Teams
When evaluating or designing testing frameworks or workflows that include agentic web fetch behavior, consider what each approach can and can’t confirm:
| Use Case | Interpreted | Raw |
|---|---|---|
| Size Limits per Backend |
✗ Model estimates only; backend not identified |
✓ Character ceilings per backend: WebFetch ~28 KB, urllib ~72 KB, unknown path 245 KB+ |
| Content-type Detection | ✗ No access to raw file | ✓ Chars/token ratio classifies content type: <3.0 = code/markup, >4.0 = prose |
| Reproducibility Verification | ✗ 2–3× variance on small files across sessions | ✓ MD5 checksums confirm byte-identical output for regression testing |
| Ground Truth Baselines | ✗ Self-report only | ✓ Agent claims vs actually fetched |
| Model Perception Gaps | ✓ Reveals when agents misreport completeness or characterize filtered excerpts as complete | ✗ Verifier confirms file integrity but not agent’s interpretation |
| UI Rendering Behavior | ✓ Reflects how Cursor displays content in chat |
✗ Saved file diverges from chat display |
| Session-dependent Variance | ✓ Captures whether new chat sessions affect output |
✗ File output is deterministic; session effects not visible |
| UX | ✓ What end users see vs what agents retrieve |
✗ Raw file isn’t what the user sees |
Agentic self-reports are unreliable for detecting truncation or content subsetting, when building workflows include a raw track-like verification.
Architecture Comparison
| Step | Cursor mid-generation |
Claude API mid-generation |
Gemini API pre-generation injection |
|---|---|---|---|
| Invocation | User asks agent via chat, agent decides which agent/tool to call |
Claude decides when to fetch based on prompts and/or URL availability | Gemini API attempts to fetch each URL from internal index cache |
| Routing | Cursor routes to one of many backends: WebFetch MCP, urllib, curl |
Claude API retrieves content |
If not cached, falls back to live fetch |
| Content Negotiation | Sends Accept: text/markdown,... header; prefers Markdown if serversupports it |
Unknown; not publicly documented | Unknown; not publicly documented |
| Content Return | Markdown usually or raw HTML on timeout |
Content comes back as a tool result in the response | URL context tool injects retrieved content into context window |
| Generation | Model generates response from fetched content |
Claude continues generation, interpreting the tool result | gemini-2.5-flash generates response from pre-loaded content |
| Key Observation | Backend selection opaque; different paths have different limits | Tool result is visible in API response; truncation via max_content_tokens |
url_context_metadata separates retrieval status from generation; token accounting split between text, prompt_token_count and URLs, tool_use_prompt_token_count |
Precision Comparison
Claude API’s web fetch has the cleanest measurement story as the tool results are first-class response fields, fully observable. Gemini neatly separates retrieval metadata. Cursor requires filesystem inspection for precise measurements, because agents deliver estimations by default.
| Platform | Character Counts | Token Counts | Reproducibility | Metadata Visibility |
|---|---|---|---|---|
| Cursor Raw |
Exact | Exact | Perfect, same MD5 |
Opaque backend routing |
| Cursor-interpreted | Agent estimation |
Agent estimation |
2-3× variance on small files |
No metadata |
| Claude web fetch |
Exact | Exact | Perfect, deterministic | Full tool result in API response |
| Gemini URL Context |
No direct access | Exact | <1% variance | First-class |
Agent Ecosystem Testing