Agent Ecosystem Testing

Cursor-Interpreted vs Raw


Track Design

Interpreted track captures what the agent believes it retrieved: how much content it saw, whether the fetch was complete, how it characterizes truncation. This is the agent’s self-report.

Raw track captures what Cursor actually saved to disk: exact byte counts, hexdump analysis, MD5 checksums, and token counts. These are filesystem measurements, cryptographic hashes, and not agent estimates.

The gap between the two tracks is a finding. If Cursor reports “content complete” in prose, but the raw data shows truncation, that discrepancy belongs in the spec.

  Interpreted Raw
Measures Agentic retrieval interpretation Filesystem measurements
of saved output
Character Counts Agent estimates, vary 2-3× across sessions on small files wc -c on disk - exact,
reproducible
Completeness Agentic truncation assessment
in prose
MD5 comparison, hexdump analysis, fence counting
Token
Counts
Agent estimates,
~4 chars/token assumption
OpenAI encoding with tiktoken
Reproducibility High variance on small docs, 1.9KB→5.6KB same URL Perfect reproducibility,
same URL = same MD5
Output
Format
Chat UI Markdown rendering Raw file on disk, raw_output_{test_id}.txt
Best For Understanding agent
perception gaps
Citable measurements
for the spec

Key Observations

  1. Reproducibility in Raw vs High Variance in Interpreted

    Raw: Same URL produces identical output

    • BL-1: MD5 d6ad8451d3778bf3544574431203a3a7 across 2 runs
    • OP-4/BL-3: MD5 554eb56e8416d86d12af17a2dfe6f815 across 3 runs
    • Character-for-character identical output on disk

    Interpreted: Same URL produces 2-3× variance on small files

    • BL-1 r1: 1,953 chars → r2: 5,595 chars → r3: 4,100 chars, 2.9× variance
    • BL-2 r1: 1,953 chars → r2: 4,200 chars → r3: 4,350 chars, 2.2× variance

    Conclusion: the variance is in how Cursor displays content in chat UI, not what it fetches. Raw measurements prove the underlying fetch is deterministic, but that interpreted track shows UI rendering isn’t.

  2. Size-Dependent Consistency

    Small file rendering appears unreliable, while larger ones seem stable:

    Interpreted:

    • Small files, 20-87 KB: high session-to-session variance
    • Large files, 245 KB: <1% variance, nearly identical across runs

    Raw:

    • Small files, 4.8 KB output: identical MD5s despite variance in interpreted display
    • Large files, 245 KB output: identical MD5s, consistent across runs

    Conclusion: Cursor fetches consistently at all sizes. The interpreted variance on small files is possibly a UI rendering artifact, not entirely reflective of fetch behavior.

  3. Perception Gap: Model Self-Report is Unreliable

    Agent claims “complete” or “no truncation” when content is a filtered subset:

    Test Raw Interpreted Gap
    SC 3 38 KB
    truncated
    at ref #14/252
    “Complete
    reference
    section”
    Agent interprets filtered
    list as complete
    BL
    1
    4,817 B
    calculated
    1,953 chars displayed UI shows subset, agent
    reports what it sees
    SC 4 Truncated mid-word at “updated” “All syntax
    sections
    present”
    Clean structure masks incompleteness

    Conclusion: trust character counts, not prose assertions; agent perceives filtered excerpts as complete because they’re internally coherent.

  4. Method-Specific Truncation Limits - Raw

    • WebFetch, MCP-style: ~28 KB ceiling, SC-4 truncated at 27,890 chars
    • urllib.request: ~72 KB ceiling, EC-6 truncated at 72,600 chars
    • curl fallback: No ceiling detected, SC-2 returned 17.6 MB
    • Unknown Path: No ceiling detected, OP-4/BL-3 returned 245 KB

    Conclusion: Cursor routes to many mechanisms with different limits. The interpreted track didn’t identify this because the agent’s self-report didn’t consistently include its toolchain.

  5. Intelligent Content Filtering

    Cursor performs structure-aware filtering, but the raw track provided the measurements:

    Interpreted: agent reports receiving “main content” but missing footer/navigation

    Raw: proves it via byte counts

    • BL-1: 85 KB HTML → 4.8KB Markdown, 94% reduction, CSS/navigation stripped
    • SC-3: 252 references → deterministically selects #14, the first commercial source
    • SC-4: 30 KB page → 28KB, footer/metadata filtered

    Conclusion: Cursor applies content heuristics, not blind truncation. Raw track quantifies what interpreted track observes qualitatively.

  6. Chars/Token Ratio as Content-Type Classifier - Raw

    • EC-3: JSON: 2.62 chars/token
    • SC-2: Raw HTML/JS: 2.65 chars/token
    • SC-3: Tables: 3.06 chars/token
    • SC-4, OP-4: Docs + code: 3.91-4.37 chars/token
    • BL-1, BL-2, SC-1: Clean Markdown: 4.13-4.36 chars/token

    Conclusion: Chars/token ratio enables content-type classification without parsing. <3.0 = code/markup, >4.0 = prose. Useful for automated analysis pipelines. The interpreted track had no visibility into this pattern.

  7. Cross-Track Agreement on Redirect Handling

    Interpreted: agent received final destination JSON content

    Raw: confirmed 5-level redirect chain traversed - 1,021 bytes JSON saved

    Conclusion: redirect handling is robust across both measurement approaches


Implications for Agent Developers, Docs Teams

When evaluating or designing testing frameworks or workflows that include agentic web fetch behavior, consider what each approach can and can’t confirm:

Use Case Interpreted Raw
Size Limits
per Backend
✗ Model estimates only;
backend not identified
✓ Character ceilings per backend: WebFetch ~28 KB, urllib ~72 KB, unknown path 245 KB+
Content-type Detection ✗ No access to raw file ✓ Chars/token ratio classifies content type: <3.0 = code/markup, >4.0 = prose
Reproducibility Verification ✗ 2–3× variance on small files across sessions ✓ MD5 checksums confirm byte-identical output for regression testing
Ground Truth Baselines ✗ Self-report only ✓ Agent claims vs actually fetched
Model Perception Gaps ✓ Reveals when agents misreport completeness or characterize filtered excerpts as complete ✗ Verifier confirms file integrity
but not agent’s interpretation
UI Rendering Behavior ✓ Reflects how Cursor displays
content in chat
✗ Saved file diverges from
chat display
Session-dependent Variance ✓ Captures whether new
chat sessions affect output
✗ File output is deterministic;
session effects not visible
UX ✓ What end users see vs
what agents retrieve
✗ Raw file isn’t what
the user sees

Agentic self-reports are unreliable for detecting truncation or content subsetting, when building workflows include a raw track-like verification.


Architecture Comparison

Step Cursor
mid-generation
Claude API
mid-generation
Gemini API
pre-generation injection
Invocation User asks agent via chat, agent decides which agent/tool
to call
Claude decides when to fetch based on prompts and/or URL availability Gemini API attempts to
fetch each URL from internal
index cache
Routing Cursor routes to one of many backends: WebFetch MCP, urllib, curl Claude API retrieves
content
If not cached, falls back
to live fetch
Content Negotiation Sends Accept: text/markdown,... header; prefers Markdown if server
supports it
Unknown; not publicly documented Unknown; not
publicly documented
Content Return Markdown usually or
raw HTML
on timeout
Content comes back as a tool result in the response URL context tool injects
retrieved content into
context window
Generation Model generates response
from fetched content
Claude continues generation, interpreting the tool result gemini-2.5-flash generates
response from pre-loaded
content
Key Observation Backend selection opaque; different paths have different limits Tool result is visible in API response; truncation via max_content_tokens url_context_metadata separates retrieval status from generation; token accounting split between text, prompt_token_count and URLs, tool_use_prompt_token_count

Precision Comparison

Claude API’s web fetch has the cleanest measurement story as the tool results are first-class response fields, fully observable. Gemini neatly separates retrieval metadata. Cursor requires filesystem inspection for precise measurements, because agents deliver estimations by default.

Platform Character Counts Token Counts Reproducibility Metadata Visibility
Cursor
Raw
Exact Exact Perfect,
same MD5
Opaque
backend routing
Cursor-interpreted Agent
estimation
Agent
estimation
2-3× variance
on small files
No metadata
Claude
web fetch
Exact Exact Perfect, deterministic Full tool result
in API response
Gemini
URL Context
No direct access Exact <1% variance First-class