Methodology

Turn-by-turn

Chat-based measurement through interaction, without direct code instrumentation

Software instrumentation is the process of adding code to a system to collect data about how it works; while the Cursor chat is public and accessible, the testing approach is different than calling an API to extract measurements programmatically.

Approach Comparison

Testing a closed consumer application vs an open API

Rather than target specific endpoints with documented interfaces, Cursor testing targets consumer application with proprietary chat behavior and multiple fetch mechanisms. Cursor’s chat web fetch and MCP implementations don’t have a public API; MCP servers are user-configured, implementations vary - mcp-server-fetch, fetch-browser-mcp, third-party, and are observable through Cursor’s agent behavior, but not instrumentable. Compare to this collection’s Claude API Web Fetch testing:

Aspect	Claude API	Cursor
Interface	Python API call, response object available	Chat UI: observable only through output
Layers	Single: URL → fetch → return	Two: URL → fetch → `@Web`* output, then agent interprets
Instrumental Access	Full: can inspect `ToolResult.content` directly	Partial: can only read agent output or manually copy `@Web` result
Repeatability	High: same URL yields identical API response	Medium: LLM interpretation varies, but `@Web` raw content should be stable
Fetch Mechanisms	One web fetch tool	Multiple: `@Web`, `mcp-server-fetch`, `fetch-browser-mcp`, third party
Best Findings	Hard limits, Claude API truncates at ~100 KB	Comparative limits: does MCP override `@Web`? Does agent auto-chunk?

*Results logged as “Methods tested: @Web” reflect prompt, user-facing syntax. However, post-analysis revealed testing misused @Web as a fetch command rather than a context attachment mechanism. The backend mechanisms WebFetch, mcp_web_fetch possibly invoked autonomously by Cursor regardless of @Web syntax, visit Friction Note for analysis.

Track Design

	Interpreted	Raw
Question	What does Cursor report back? Does it accurately perceive truncation? Are there systematic estimation errors?*	What actually came through the `@Web` command? Where exactly does truncation occur? Is the boundary consistent?
Method	Chat prompt asks `@Web` to fetch URL and report measurements	Chat prompt asks `@Web` to fetch URL and return output verbatim, human manually extracts measurements
Captures	Cursor and underlying LLM’s interpretation of truncation, completeness	Actual response content from `@Web` command, post-processing, exact character boundaries
Measurements	LLM estimates: “appears truncated,” “approximately X KB,” “markdown seems complete”	Manual: character count via `len()`, token count via `tiktoken`, exact truncation point, last 50 characters
Repeatability	Varies between runs	Reproducible: same URL fetched multiple times yields consistent content
Best For	Understanding DX, identifying perception gaps	Citable baseline measurements for Agent-Friendly Docs Spec

Approach limitations: general variation between runs; can’t programmatically inspect a surfaced API field; variation expected between MCP server >versions, IDE version, LLM selection; some URLs possibly gated

Cursor-Specific Unknowns

Question	Details	Approach	Value
Multiple Fetch Mechanisms	`@Web` native, proprietary `mcp-server-fetch` configurable; `fetch-browser-mcp` headless browser; third party servers	Compare side-by-side on identical URLs	Determines if one mechanism has different limits; unique to Cursor, addresses ecosystem testing gap
HTML-to-Markdown Conversion Timing	*Does Cursor truncate before or after HTML→markdown conversion?*	`SC-1`-`SC-4` measure truncation relative to content structure	Pre-conversion: lose 40-50% of characters to HTML/CSS overhead Post-conversion: Markdown smaller, but structure may break at boundary
Agent Auto-chunking	*After truncation, does `@Web` automatically request next chunk or require manual request?*	`OP-4` agent retry pattern: observe unprompted follow-up fetches	Not well-explored in Claude API testing; key gap in ecosystem methodology, shapes DX with large docs
Model Variability	Cursor’s `Auto` chat default setting; additionally supports Claude’s `Opus`, `Sonnet`, `Gemini`, `GPT-5`	Run tests with one LLM, tracked per run	Isolates fetch behavior from LLM inference variance; differences documented separately