Key Findings for Copilot’s Web Fetch Behavior, Copilot-interpreted
Test Workflow
- Run
python web_content_retrieval_testing_framework.py --test {test ID} --track interpreted - Review terminal output
- Copy the provided prompt asking the model to report on fetch results:
character count,
token estimate, truncation status, content completeness, Markdown formatting integrity,
and tool visibility - Open a new Copilot session in VS Code and paste the prompt into the chat window
- Skip any tool call prompts for local scripts or code execution
- Capture model’s full text response and observations as the interpreted finding;
gap between the model’s self-report and actual fetch behavior is a finding - Log structured metadata as described in
framework-reference.md - Ensure esults saved to
/results/copilot-interpreted/results.csv*
*Results logged as “Methods tested:
vscode-chat” reflect manual process in which prompts are copy-pasted into the Copilot chat window. Copilot has no publicly documented backend web content retrieval mechanism; tool logs identifiedfetch_webpage; read Friction Note for analysis.
Platform Limit Summary
| Limit | Observed |
|---|---|
| Hard Character Limit | None detected: fetch_webpage returns relevance-ranked excerpts by design, not raw pages with a byte ceiling |
| Hard Token Limit |
None detected as a fixed ceiling: avg 7,313 tokens across 55 runs; output varies by relevance ranking |
| Output Consistency | High variance; EC-3 JSON payload 651 chars to SC-3 Wikipedia ~150,000 chars; same URL and model can produce 2x difference |
| Content Selection Behavior | Relevance-ranked excerpting: tool returns semantically filtered chunks keyed to a query parameter, not sequential page content |
| Truncation Pattern | ... markers throughout output are retrieval-layer elision indicators, not byte-boundary cutoffs |
| Redirect Chains |
Successfully follows: tested 5-level redirect chain in EC-3;User-Agent value internally truncated in returned JSON |
| Self-reported Completeness | Unreliable: model flags ... markers as truncation evidence may misattributecause as structural property of fetch_webpage, likely not hitting a size limit |
| Model Routing |
Unstable: Auto dispatches to at least 5 distinct models with no documented routing logic and no UI indication when switching occurs |
| Tool Substitution |
Agent autonomously attempts local code execution pylanceRunCodeSnippet,zsh despite prompt guardrails |
Results Details
| Model Selector | Auto |
| Models Observed | Claude Haiku 4.5, Claude Sonnet 4.6, GPT-5.3-Codex,GPT-5.4, Grok Code Fast 1, Raptor mini (Preview) |
| Total Tests | 55 |
| Distinct URLs | 11 |
| Input Size Range | ~2 KB–256 KB |
| Truncation Events | 54 / 55 |
| Average Output Size | 29,239 chars |
| Average Token Count | 7,313 tokens |
| Truncation Detection | Model assertion, verbatim last-50-chars, Markdown integrity, ... elision marker count |
Cross-run Output Variance
The chart below plots output_chars for each run on a logarithmic y-axis, colored by
model_observed. Points are jittered slightly on the x-axis so overlapping runs remain visible.
Hover over any point to see test ID, model, and exact char count.
SC-3returned 115K–150K chars, the highest output of any test by a wide margin.BL-3shows the highest within-test variance:Claude Haiku 4.5 0.3xreturned 87K and 42K while GPT-family runs on the same URL clustered at 15K–22K.
| Test | Category | r1 chars | r2 chars | r3 chars | r4 chars | r5 chars | Variance |
|---|---|---|---|---|---|---|---|
| BL 1 |
Baseline 87 KB |
24,500 | timeout | 4,500 | 3,200 | 8,750 | * |
| BL 2 |
Baseline 20 KB |
4,200 | 2,950 | 8,472 | 4,300 | 4,850 | 2.9x |
| BL 3 |
Baseline 256 KB |
22,500 | 15,000 | 87,000 | 21,000 | 42,850 | 5.8x |
| SC 2 |
Code blocks 82 KB |
13,847 | 8,000 | 12,500 | 13,250 | 16,900 | 2.1x |
| SC 3 |
Wikipedia 102 KB |
130,000 | 150,000 | 125,000 | 120,000 | 115,000 | 1.3x |
| SC 4 |
Markdown Guide 31 KB |
32,500 | 48,500 | 33,000 | 30,000 | 16,250 | 3.0x |
| EC 1 |
Landing Page 102 KB |
14,000 | 14,000 | 14,800 | 7,200 | 6,400 | 2.3x |
| EC 3 |
Redirect Chain 2 KB |
651 | 651 | 890 | 874 | 1,090 | 1.7x |
| EC 6 |
Raw Markdown 61 KB |
60,000 | 60,000 | 40,000 | 40,000 | 40,000 | 1.5x |
| OP 4 |
Auto-chunking 256 KB |
33,000 | 25,000 | 12,500 | 25,000 | 12,000 | 2.6x |
*Excluding the timeout gives a variance of 7.7x, which is the highest variance of any test, but calculating variance with 0 and/or timeout is meaningless when it represents a failed run rather than a real retrieval result
Truncation Analysis
| # | Finding | Tests | Observed | Conclusion |
|---|---|---|---|---|
| 1 | fetch_webpage performs relevance-ranked excerpting,not raw HTTP retrieval |
All tests |
Tool preamble visible across runs: “Here is some relevant context from the web page [URL]:” - output is semantically filtered chunks separated by ... markers, not a sequential page dump with a byte cutoff |
fetch_webpage is an excerpt retrieval tool by design; character count variance across runs reflects relevance-ranking variance, not a size ceiling hit differently |
| 2 | No fixed character or token ceiling detected | SC-3BL-3EC-6 |
SC-3 Wikipedia runs returned 115k-150k chars; BL-3 Claude Haiku run returned 87k chars;no run hit a clean hard cutoff boundary |
If a ceiling exists, it’s high enough that no test has reached it; the practical constraint is the relevance model’s excerpt selection, not a byte limit |
| 3 | Output variance is high and model-dependent | BL-3SC-4OP-4 |
BL-3 shows 5.8x variance across 5 runs; Claude Haiku 4.5 returned 87k chars in a single fetch with no self-diagnosis; GPT-family models returned 15k–22,500 chars with 2 fetches andself-diagnosis |
Model routing is an uncontrolled variable; runs of the same test with different model_observed values aren’t comparable |
| 4 | GPT-family and Claude-family models exhibit distinct fetch behaviors | BL-3SC-3SC-4OP-4 |
GPT-family: 2–4 fetch invocations per run, self-diagnoses first result as insufficient and re-fetches; Claude-family: 1 fetch invocation per run, no self-diagnosis or re-fetch, higher output size | Behavioral split is model-family level, not run level noise; fetch invocation count and output size confounded with model routing |
| 5 | Agent misidentifies fetch_webpage’s architectural excerpting as truncation |
All tests | Models consistently flag ... markers and repeated sections as truncation evidence, but these are the tool’s own elision indicators from its relevance-ranking layer, not byte-boundary artifacts |
H1-yes results confirm the full page wasn’t returned but can’t confirm a fixed character ceiling; the tool may not be capable of sequential full-page retrieval by design |
| 6 | Redirect chains followed transparently; structured JSON payloads partially truncated | EC-3 |
5-level redirect chain followed silently to /get; returned JSON structurally complete - args, headers, origin, URL present, but User-Agent value internally truncated with ... markers; trailing “Pretty-print” UI element confirms HTML DOM extraction notraw HTTP response |
fetch_webpage follows redirects without user awareness; even small structured payloads are subject to internal value truncation; tool retrieves rendered HTMLnot raw API response body |
| 7 | Landing and navigation pages return substantially less content than docs pages | EC-1 |
Gemini API landing page consistently returned 6,400–14,800 chars against ~100KB expected; agent noted page body is largely collapsed to navigation links with little dense prose for relevance model to extract |
Low retrieval rates reflect URL type, not a lower size ceiling; relevance-based extraction returns less content from nav pages because there’s less extractable prose |
| 8 | Tool substitution attempts persist despite explicit prompt guardrails | BL-1BL-2EC-3 |
Agent attempted pylanceRunCodeSnippet and zsh shell commands across multiple tests despite prompts explicitly prohibiting local scripts; in one case agent asserted compliance while triggering the tool prompt |
Prompt guardrails alone can’t prevent autonomous tool substitution; flag skipped attempts should as methodology deviations; don’t classify shell commands as “scripts” by the agent’s compliance evaluation |
| 9 | fetch_webpage undocumented; tool parameters not consistently surfaced |
All tests | Tool has no public docs; asking Copilot directly returns deflection; query parameter and urls array only surfaced in one SC-4 run with Claude Sonnet 4.6; most runs expose only tool name and preamble string |
Tool behavior, size limits, and invocation conditions are opaque; results reflect observed tool output, not an API contract |
| 10 | H5 auto-chunking hypothesis not applicable to fetch_webpage |
OP-4 |
fetch_webpage returns relevance-ranked semantic excerpts; no sequential chunk boundary exists to paginate from; agent re-fetches are diagnostic retries on the same excerpted payload, not continuation requests |
OP-4 hypothesis assumes sequential retrieval that fetch_webpage can’t perform; requires different retrieval tool would to test H5 meaningfully |
Model Routing Distribution
| Model | Runs Observed | Fetch Pattern | Avg Output, chars |
|---|---|---|---|
Claude Haiku 4.5 |
4 | 1 invocation; no self-diagnosis highest output ceiling |
~42,000 |
Claude Sonnet 4.6 |
10 | 1–2 invocations; no self-diagnosis |
~15,000 |
GPT-5.3-Codex |
30 | 2–4 invocations; self-diagnoses, re-fetches |
~25,000 |
GPT-5.4 |
3 | 2 invocations; self-diagnoses | ~19,000 |
Grok Code Fast 1 |
1 | 1 invocation | ~8,500 |
Raptor mini |
6 | 1 invocation; lowest output of any model |
~4,500 |
Perception Gap
Can’t validate
fetch_webpageoutput against expected page size alone; tool’s relevance-ranked excerpting means character count reflects content selection, not size ceiling. Model truncation self-report consistently correct in identifying incomplete content, but wrong about the cause.
| Test | Expected | Returned | Retrieval Rate | Agent’s Characterization |
|---|---|---|---|---|
SC-3 Wikipedia |
~102 KB | 115,000–150,000 chars |
~113–147% of chars* | “Truncated - repeated ...markers and section stitching” |
BL-3Tutorial |
~256 KB | 15,000–87,000 chars |
6–34% | “Truncated condensed/excerpted extraction” |
EC-1Gemini Landing |
~100 KB | 6,400–14,800 chars |
6–15% | “Truncated curated retrieval summary” |
EC-6SPEC.md |
~61 KB | 40,000–60,000 chars |
65–98% | “Truncated structurally transformed, not raw file” |
EC-3 Redirect/JSON |
~2 KB | 651–1,090 chars |
32–53% | “Truncated User-Agent value internally cut” |
*
SC-3apparent over-retrieval reflects Wikipedia’s actual page size exceeding the ~102 KBinput_est_charsestimate, not a measurement error
Agent Ecosystem Testing