Agent Ecosystem Testing

Methodology


Turn-by-turn

Chat-based measurement through interaction, without direct code instrumentation

The Copilot testing framework shares its foundational approach with the Cursor testing framework: intentionally not automated, prompt-based inference through a chat interface, with no programmatic access to the underlying mechanisms. Unlike Cursor, Copilot exposes no user-facing fetch syntax - web content retrieval happens entirely at the agent’s discretion, via undocumented backend tools.

Approach Comparison

Testing a closed consumer application vs an open API

Rather than target specific API endpoints with documented interfaces, Copilot testing targets a consumer application with proprietary chat behavior and undocumented structure. Copilot’s web fetch implementation doesn’t have a public API; the backend tool, observed in runs as fetch_webpage, is agent-selected, not user-invocable, and surfaces only through tool logs. Compare to Claude API Web Fetch testing -

Aspect Claude API Copilot
Interface Python API call, response object available Chat interface, observable
only through output
Layers Single: URL → fetch → return Two: URL → fetch_webpage output,
then model interprets
Instrumental Access Full: can inspect
ToolResult.content directly
Partial: can only read model’s output;
raw tool response not surfaced
Repeatability High: same URL yields
identical API response
Low: model routing varies per run;
character counts inconsistent across identical prompts
Fetch Mechanisms One web fetch tool fetch_webpage and/or curl but invocation
is agent-decided, undocumented
Best Findings Hard limits
Claude API truncates
at ~100KB
Comparative limits - does fetch_webpage
have a response size ceiling?
Does it vary by model?

Results logged as method: vscode-chat describe user-facing interface. Calling backend mechanism fetch_webpage isn’t guaranteed per run; read Friction Note for analysis.


Track Design

  Copilot-interpreted Raw
Question What does Copilot report back? Does it accurately perceive truncation? Are there systematic estimation errors?* What does fetch_webpage actually return? Where exactly does truncation occur? Is the boundary consistent?
Method Prompt asks Copilot to fetch URL, report measurements Prompt asks Copilot to fetch URL and return output verbatim, verification script extracts measurements
Captures Copilot and agent’s interpretation of truncation, completeness Response content from fetch_webpage, post-processing, exact character boundaries
Measurements Agent estimates: “appears truncated,” “approximately X chars,” “Markdown seems complete” Character count via len(), token count via tiktoken, exact truncation point,last 50 characters
Repeatability Low - varies between runs due to model routing variance and Auto model switching Medium - same URL should yield consistent fetch_webpage output, pending model routing stability
Best For Understanding DX, surfacing perception gaps, documenting Auto routing behavior Citable baseline measurements
for Agent-Friendly Docs Spec

Limitations: varies between runs due to Auto model routing; Claude Haiku 4.5, Raptor mini (Preview) produced significantly different character counts with identical prompts; can’t programmatically inspect an API field; no documented size limit, invocation conditions, or output format specification for fetch_webpage


Copilot-Specific Unknowns

Question Details Approach Value
Undocumented Fetch Mechanism fetch_webpage surfaces in tool logs but has no public docs; Copilot docs don’t describe any web fetch
tool by name
Observe tool name reported per run; compare raw vs interpreted outputs Establishes whether fetch_webpage is stable enough to treat as a consistent mechanism
across runs
Auto
Model
Routing
Copilot’s Auto setting has selected both Claude Haiku 4.5 and Raptor mini (Preview) on identical prompts in the same session Log model per run; analyze character count variance by model Isolates whether truncation ceiling is a property of fetch_webpage or of the model processing
its output
Response
Size Ceiling
Observed outputs range from ~3,200 to ~24,500 characters on the same URL across runs; no documented limit exists Compare both track measurements across test suite Determines if there is a consistent ceiling, whether it varies by model or by fetch attempt
Local Code Execution Substitution Copilot autonomously uses pylanceRunCodeSnippet via Pylance MCP server when workspace scripts are present, even with
explicit prompt
Log all runs where substitution attempted; test with framework scripts removed from workspace Determines whether workspace context is the trigger; shapes environment requirements for clean test runs