Key Findings for Copilot’s Web Fetch Behavior, Copilot-interpreted

Test Workflow

Run python web_content_retrieval_testing_framework.py --test {test ID} --track interpreted
Review terminal output
Copy the provided prompt asking the model to report on fetch results: character count,
token estimate, truncation status, content completeness, Markdown formatting integrity,
and tool visibility
Open a new Copilot session in VS Code and paste the prompt into the chat window
Skip any tool call prompts for local scripts or code execution
Capture model’s full text response and observations as the interpreted finding;
gap between the model’s self-report and actual fetch behavior is a finding
Log structured metadata as described in framework-reference.md
Ensure esults saved to /results/copilot-interpreted/results.csv*

*Results logged as “Methods tested: vscode-chat” reflect manual process in which prompts are copy-pasted into the Copilot chat window. Copilot has no publicly documented backend web content retrieval mechanism; tool logs identified fetch_webpage; read Friction Note for analysis.

Platform Limit Summary

Limit	Observed
Hard Character Limit	None detected: `fetch_webpage` returns relevance-ranked excerpts by design, not raw pages with a byte ceiling
Hard Token Limit	None detected as a fixed ceiling: avg 7,313 tokens across 55 runs; output varies by relevance ranking
Output Consistency	High variance; `EC-3` JSON payload 651 chars to `SC-3` Wikipedia ~150,000 chars; same URL and model can produce 2x difference
Content Selection Behavior	Relevance-ranked excerpting: tool returns semantically filtered chunks keyed to a query parameter, not sequential page content
Truncation Pattern	`...` markers throughout output are retrieval-layer elision indicators, not byte-boundary cutoffs
Redirect Chains	Successfully follows: tested 5-level redirect chain in `EC-3`; User-Agent value internally truncated in returned JSON
Self-reported Completeness	Unreliable: model flags `...` markers as truncation evidence may misattribute cause as structural property of `fetch_webpage`, likely not hitting a size limit
Model Routing	Unstable: `Auto` dispatches to at least 5 distinct models with no documented routing logic and no UI indication when switching occurs
Tool Substitution	Agent autonomously attempts local code execution `pylanceRunCodeSnippet`, `zsh` despite prompt guardrails

Results Details

Model Selector	`Auto`
Models Observed	`Claude Haiku 4.5`, `Claude Sonnet 4.6`, `GPT-5.3-Codex`, `GPT-5.4`, `Grok Code Fast 1`, `Raptor mini (Preview)`
Total Tests	55
Distinct URLs	11
Input Size Range	~2 KB–256 KB
Truncation Events	54 / 55
Average Output Size	29,239 chars
Average Token Count	7,313 tokens
Truncation Detection	Model assertion, verbatim last-50-chars, Markdown integrity, `...` elision marker count

Cross-run Output Variance

The chart below plots output_chars for each run on a logarithmic y-axis, colored by model_observed. Points are jittered slightly on the x-axis so overlapping runs remain visible. Hover over any point to see test ID, model, and exact char count.

SC-3 returned 115K–150K chars, the highest output of any test by a wide margin. BL-3 shows the highest within-test variance: Claude Haiku 4.5 0.3x returned 87K and 42K while GPT-family runs on the same URL clustered at 15K–22K.

Test	Category	r1 chars	r2 chars	r3 chars	r4 chars	r5 chars	Variance
BL 1	Baseline 87 KB	24,500	timeout	4,500	3,200	8,750	*
BL 2	Baseline 20 KB	4,200	2,950	8,472	4,300	4,850	2.9x
BL 3	Baseline 256 KB	22,500	15,000	87,000	21,000	42,850	5.8x
SC 2	Code blocks 82 KB	13,847	8,000	12,500	13,250	16,900	2.1x
SC 3	Wikipedia 102 KB	130,000	150,000	125,000	120,000	115,000	1.3x
SC 4	Markdown Guide 31 KB	32,500	48,500	33,000	30,000	16,250	3.0x
EC 1	Landing Page 102 KB	14,000	14,000	14,800	7,200	6,400	2.3x
EC 3	Redirect Chain 2 KB	651	651	890	874	1,090	1.7x
EC 6	Raw Markdown 61 KB	60,000	60,000	40,000	40,000	40,000	1.5x
OP 4	Auto-chunking 256 KB	33,000	25,000	12,500	25,000	12,000	2.6x

*Excluding the timeout gives a variance of 7.7x, which is the highest variance of any test, but calculating variance with 0 and/or timeout is meaningless when it represents a failed run rather than a real retrieval result

Truncation Analysis

#	Finding	Tests	Observed	Conclusion
1	`fetch_webpage` performs relevance-ranked excerpting, not raw HTTP retrieval	All tests	Tool preamble visible across runs: “Here is some relevant context from the web page [URL]:” - output is semantically filtered chunks separated by `...` markers, not a sequential page dump with a byte cutoff	`fetch_webpage` is an excerpt retrieval tool by design; character count variance across runs reflects relevance-ranking variance, not a size ceiling hit differently
2	No fixed character or token ceiling detected	`SC-3` `BL-3` `EC-6`	`SC-3` Wikipedia runs returned 115k-150k chars; `BL-3` `Claude Haiku` run returned 87k chars; no run hit a clean hard cutoff boundary	If a ceiling exists, it’s high enough that no test has reached it; the practical constraint is the relevance model’s excerpt selection, not a byte limit
3	Output variance is high and model-dependent	`BL-3` `SC-4` `OP-4`	`BL-3` shows 5.8x variance across 5 runs; `Claude Haiku 4.5` returned 87k chars in a single fetch with no self-diagnosis; GPT-family models returned 15k–22,500 chars with 2 fetches and self-diagnosis	Model routing is an uncontrolled variable; runs of the same test with different `model_observed` values aren’t comparable
4	GPT-family and Claude-family models exhibit distinct fetch behaviors	`BL-3` `SC-3` `SC-4` `OP-4`	GPT-family: 2–4 fetch invocations per run, self-diagnoses first result as insufficient and re-fetches; Claude-family: 1 fetch invocation per run, no self-diagnosis or re-fetch, higher output size	Behavioral split is model-family level, not run level noise; fetch invocation count and output size confounded with model routing
5	Agent misidentifies `fetch_webpage`’s architectural excerpting as truncation	All tests	Models consistently flag `...` markers and repeated sections as truncation evidence, but these are the tool’s own elision indicators from its relevance-ranking layer, not byte-boundary artifacts	`H1-yes` results confirm the full page wasn’t returned but can’t confirm a fixed character ceiling; the tool may not be capable of sequential full-page retrieval by design
6	Redirect chains followed transparently; structured JSON payloads partially truncated	`EC-3`	5-level redirect chain followed silently to `/get`; returned JSON structurally complete - args, headers, origin, URL present, but User-Agent value internally truncated with `...` markers; trailing “Pretty-print” UI element confirms HTML DOM extraction not raw HTTP response	`fetch_webpage` follows redirects without user awareness; even small structured payloads are subject to internal value truncation; tool retrieves rendered HTML not raw API response body
7	Landing and navigation pages return substantially less content than docs pages	`EC-1`	Gemini API landing page consistently returned 6,400–14,800 chars against ~100KB expected; agent noted page body is largely collapsed to navigation links with little dense prose for relevance model to extract	Low retrieval rates reflect URL type, not a lower size ceiling; relevance-based extraction returns less content from nav pages because there’s less extractable prose
8	Tool substitution attempts persist despite explicit prompt guardrails	`BL-1` `BL-2` `EC-3`	Agent attempted `pylanceRunCodeSnippet` and `zsh` shell commands across multiple tests despite prompts explicitly prohibiting local scripts; in one case agent asserted compliance while triggering the tool prompt	Prompt guardrails alone can’t prevent autonomous tool substitution; flag skipped attempts should as methodology deviations; don’t classify shell commands as “scripts” by the agent’s compliance evaluation
9	`fetch_webpage` undocumented; tool parameters not consistently surfaced	All tests	Tool has no public docs; asking Copilot directly returns deflection; `query` parameter and `urls` array only surfaced in one `SC-4` run with `Claude Sonnet 4.6`; most runs expose only tool name and preamble string	Tool behavior, size limits, and invocation conditions are opaque; results reflect observed tool output, not an API contract
10	`H5` auto-chunking hypothesis not applicable to `fetch_webpage`	`OP-4`	`fetch_webpage` returns relevance-ranked semantic excerpts; no sequential chunk boundary exists to paginate from; agent re-fetches are diagnostic retries on the same excerpted payload, not continuation requests	`OP-4` hypothesis assumes sequential retrieval that `fetch_webpage` can’t perform; requires different retrieval tool would to test H5 meaningfully

Model Routing Distribution

Model	Runs Observed	Fetch Pattern	Avg Output, chars
`Claude Haiku 4.5`	4	1 invocation; no self-diagnosis highest output ceiling	~42,000
`Claude Sonnet 4.6`	10	1–2 invocations; no self-diagnosis	~15,000
`GPT-5.3-Codex`	30	2–4 invocations; self-diagnoses, re-fetches	~25,000
`GPT-5.4`	3	2 invocations; self-diagnoses	~19,000
`Grok Code Fast 1`	1	1 invocation	~8,500
`Raptor mini`	6	1 invocation; lowest output of any model	~4,500

Perception Gap

Can’t validate fetch_webpage output against expected page size alone; tool’s relevance-ranked excerpting means character count reflects content selection, not size ceiling. Model truncation self-report consistently correct in identifying incomplete content, but wrong about the cause.

Test	Expected	Returned	Retrieval Rate	Agent’s Characterization
`SC-3` Wikipedia	~102 KB	115,000–150,000 chars	~113–147% of chars*	“Truncated - repeated `...` markers and section stitching”
`BL-3` Tutorial	~256 KB	15,000–87,000 chars	6–34%	“Truncated condensed/excerpted extraction”
`EC-1` Gemini Landing	~100 KB	6,400–14,800 chars	6–15%	“Truncated curated retrieval summary”
`EC-6` `SPEC.md`	~61 KB	40,000–60,000 chars	65–98%	“Truncated structurally transformed, not raw file”
`EC-3` Redirect/JSON	~2 KB	651–1,090 chars	32–53%	“Truncated User-Agent value internally cut”

*SC-3 apparent over-retrieval reflects Wikipedia’s actual page size exceeding the ~102 KB input_est_chars estimate, not a measurement error