Methodology

Turn-by-turn

Chat-based measurement through interaction, without direct code instrumentation

The Copilot testing framework shares its foundational approach with the Cursor testing framework: intentionally not automated, prompt-based inference through a chat interface, with no programmatic access to the underlying mechanisms. Unlike Cursor, Copilot exposes no user-facing fetch syntax - web content retrieval happens entirely at the agent’s discretion, via undocumented backend tools.

Approach Comparison

Testing a closed consumer application vs an open API

Rather than target specific API endpoints with documented interfaces, Copilot testing targets a consumer application with proprietary chat behavior and undocumented structure. Copilot’s web fetch implementation doesn’t have a public API; the backend tool, observed in runs as fetch_webpage, is agent-selected, not user-invocable, and surfaces only through tool logs. Compare to Claude API Web Fetch testing -

Aspect	Claude API	Copilot
Interface	Python API call, response object available	Chat interface, observable only through output
Layers	Single: URL → fetch → return	Two: URL → `fetch_webpage` output, then model interprets
Instrumental Access	Full: can inspect `ToolResult.content` directly	Partial: can only read model’s output; raw tool response not surfaced
Repeatability	High: same URL yields identical API response	Low: model routing varies per run; character counts inconsistent across identical prompts
Fetch Mechanisms	One web fetch tool	`fetch_webpage` and/or `curl` but invocation is agent-decided, undocumented
Best Findings	Hard limits Claude API truncates at ~100KB	Comparative limits - does `fetch_webpage` have a response size ceiling? Does it vary by model?

Results logged as method: vscode-chat describe user-facing interface. Calling backend mechanism fetch_webpage isn’t guaranteed per run; read Friction Note for analysis.

Track Design

	Copilot-interpreted	Raw
Question	What does Copilot report back? Does it accurately perceive truncation? Are there systematic estimation errors?*	What does `fetch_webpage` actually return? Where exactly does truncation occur? Is the boundary consistent?
Method	Prompt asks Copilot to fetch URL, report measurements	Prompt asks Copilot to fetch URL and return output verbatim, verification script extracts measurements
Captures	Copilot and agent’s interpretation of truncation, completeness	Response content from `fetch_webpage`, post-processing, exact character boundaries
Measurements	Agent estimates: “appears truncated,” “approximately X chars,” “Markdown seems complete”	Character count via `len()`, token count via `tiktoken`, exact truncation point,last 50 characters
Repeatability	Low - varies between runs due to model routing variance and `Auto` model switching	Medium - same URL should yield consistent `fetch_webpage` output, pending model routing stability
Best For	Understanding DX, surfacing perception gaps, documenting `Auto` routing behavior	Citable baseline measurements for Agent-Friendly Docs Spec

Limitations: varies between runs due to Auto model routing; Claude Haiku 4.5, Raptor mini (Preview) produced significantly different character counts with identical prompts; can’t programmatically inspect an API field; no documented size limit, invocation conditions, or output format specification for fetch_webpage

Copilot-Specific Unknowns

Question	Details	Approach	Value
Undocumented Fetch Mechanism	`fetch_webpage` surfaces in tool logs but has no public docs; Copilot docs don’t describe any web fetch tool by name	Observe tool name reported per run; compare raw vs interpreted outputs	Establishes whether `fetch_webpage` is stable enough to treat as a consistent mechanism across runs
`Auto` Model Routing	Copilot’s `Auto` setting has selected both `Claude Haiku 4.5` and `Raptor mini (Preview)` on identical prompts in the same session	Log model per run; analyze character count variance by model	Isolates whether truncation ceiling is a property of `fetch_webpage` or of the model processing its output
Response Size Ceiling	Observed outputs range from ~3,200 to ~24,500 characters on the same URL across runs; no documented limit exists	Compare both track measurements across test suite	Determines if there is a consistent ceiling, whether it varies by model or by fetch attempt
Local Code Execution Substitution	Copilot autonomously uses `pylanceRunCodeSnippet` via Pylance MCP server when workspace scripts are present, even with explicit prompt	Log all runs where substitution attempted; test with framework scripts removed from workspace	Determines whether workspace context is the trigger; shapes environment requirements for clean test runs