Key Findings for Copilot’s Web Fetch Behavior, Copilot-interpreted
Copilot-interpreted Test Workflow:
1. Run `python web_content_retrieval_testing_framework.py --test {test ID} --track interpreted`
2. Review the terminal output
3. Copy the provided prompt asking the model to report on fetch results:
character count, token estimate, truncation status, content completeness,
Markdown formatting integrity, and tool visibility
4. Open a new Copilot chat session in VS Code and paste the prompt into the chat window
5. Skip any tool call prompts for local scripts or code execution
6. Capture model's full text response and observations as the interpreted finding;
gap between the model's self-report and actual fetch behavior is a finding
7. Log structured metadata as described in the `framework-reference.md`
8. Ensure log results are saved to `/results/copilot-interpreted/results.csv`*
*Results logged as “Methods tested:
vscode-chat” reflect a manually operated testing process in which prompts are copy-pasted into the Copilot chat window. Copilot has no publicly documented backend web content retrieval mechanism; identifiedfetch_webpagethrough tool logs. Read the Friction Note for analysis.
Platform Limit Summary
| Limit | Observed |
|---|---|
| Hard Character Limit | None detected: fetch_webpage returns relevance-ranked excerpts by design, not raw pages with a byte ceiling |
| Hard Token Limit | None detected as a fixed ceiling: avg 7,313 tokens across 55 runs; output varies by relevance ranking |
| Output Consistency | High variance; EC-3 JSON payload 651 chars to SC-3 Wikipedia ~150,000 chars; same URL and model can produce 2x difference |
| Content Selection Behavior | Relevance-ranked excerpting: tool returns semantically filtered chunks keyed to a query parameter, not sequential page content |
| Truncation Pattern | ... markers throughout output are retrieval-layer elision indicators, not byte-boundary cutoffs |
| Redirect Chains | Successfully follows: tested 5-level redirect chain in EC-3;User-Agent value internally truncated in returned JSON |
| Self-reported Completeness | Unreliable: model flags ... markers as truncation evidence may misattribute cause as structural property of fetch_webpage,likely not hitting a size limit |
| Model Routing | Unstable: Auto dispatches to at least 5 distinct models with no documented routing logic and no UI indicationwhen switching occurs |
| Tool Substitution | Agent autonomously attempts local code execution pylanceRunCodeSnippet, zsh despite prompt guardrails |
Results Details
| Model Selector | Auto |
| Models Observed | Claude Haiku 4.5, Claude Sonnet 4.6, GPT-5.3-Codex,GPT-5.4, Grok Code Fast 1, Raptor mini (Preview) |
| Total Tests | 55 |
| Distinct URLs | 11 |
| Input Size Range | ~2KB–256KB |
| Truncation Events | 54 / 55 |
| Average Output Size | 29,239 chars |
| Average Token Count | 7,313 tokens |
| Truncation Detection | Model assertion, verbatim last-50-chars, Markdown integrity, ... elision marker count |
Cross-run Output Variance
The chart below plots output_chars for each run on a logarithmic y-axis, colored by
model_observed. Points are jittered slightly on the x-axis so overlapping runs remain visible.
Hover over any point to see test ID, model, and exact char count.
SC-3returned 115k–150k chars, the highest output of any test by a wide margin.BL-3shows the highest within-test variance:Claude Haiku 4.5 0.3xreturned 87k and 42k while GPT-family runs on the same URL clustered at 15k–22k.
| Test | Category | r1 chars | r2 chars |
r3 chars | r4 chars | r5 chars | Variance |
|---|---|---|---|---|---|---|---|
| BL-1 | Baseline - 87KB | 24,500 | timeout | 4,500 | 3,200 | 8,750 | * |
| BL-2 | Baseline - 20KB | 4,200 | 2,950 | 8,472 | 4,300 | 4,850 | 2.9x |
| BL-3 | Baseline - 256KB | 22,500 | 15,000 | 87,000 | 21,000 | 42,850 | 5.8x |
| SC-2 | Code blocks - 82KB | 13,847 | 8,000 | 12,500 | 13,250 | 16,900 | 2.1x |
| SC-3 | Wikipedia - 102KB | 130,000 | 150,000 | 125,000 | 120,000 | 115,000 | 1.3x |
| SC-4 | Markdown Guide - 31KB | 32,500 | 48,500 | 33,000 | 30,000 | 16,250 | 3.0x |
| EC-1 | Landing page - 102KB | 14,000 | 14,000 | 14,800 | 7,200 | 6,400 | 2.3x |
| EC-3 | Redirect chain - 2KB | 651 | 651 | 890 | 874 | 1,090 | 1.7x |
| EC-6 | Raw Markdown - 61KB | 60,000 | 60,000 | 40,000 | 40,000 | 40,000 | 1.5x |
| OP-4 | Auto-chunking - 256KB | 33,000 | 25,000 | 12,500 | 25,000 | 12,000 | 2.6x |
*Excluding the timeout gives a variance of 7.7x, which is the highest variance of any test, but calculating variance with 0 and/or timeout is meaningless when it represents a failed run rather than a real retrieval result
Truncation Analysis
| # | Finding | Tests | Observed | Conclusion |
|---|---|---|---|---|
| 1 | fetch_webpage performs relevance-ranked excerpting,not raw HTTP retrieval |
All tests |
Tool preamble visible across runs: “Here is some relevant context from the web page [URL]:” - output is semantically filtered chunks separated by ... markers, not a sequential page dump with a byte cutoff |
fetch_webpage is an excerpt retrieval tool by design; character count variance across runs reflects relevance-ranking variance, not a size ceiling hit differently |
| 2 | No fixed character or token ceiling detected | SC-3,BL-3,EC-6 |
SC-3 Wikipedia runs returned 115k-150k chars; BL-3 Claude Haiku run returned 87k chars;no run hit a clean hard cutoff boundary |
If a ceiling exists, it’s high enough that no test has reached it; the practical constraint is the relevance model’s excerpt selection, not a byte limit |
| 3 | Output variance is high and model-dependent | BL-3,SC-4,OP-4 |
BL-3 shows 5.8x variance across 5 runs; Claude Haiku 4.5 returned 87k chars in a single fetch with no self-diagnosis; GPT-family models returned 15k–22,500 chars with 2 fetches andself-diagnosis |
Model routing is an uncontrolled variable; runs of the same test with different model_observed values aren’t comparable |
| 4 | GPT-family and Claude-family models exhibit distinct fetch behaviors | BL-3,SC-3,SC-4,OP-4 |
GPT-family: 2–4 fetch invocations per run, self-diagnoses first result as insufficient and re-fetches; Claude-family: 1 fetch invocation per run, no self-diagnosis or re-fetch, higher output size | Behavioral split is model-family level, not run level noise; fetch invocation count and output size confounded with model routing |
| 5 | Agent misidentifies fetch_webpage’s architectural excerpting as truncation |
All interpreted runs | Models consistently flag ... markers and repeated sections as truncation evidence, but these are the tool’s own elision indicators from its relevance-ranking layer, not byte-boundary artifacts |
H1-yes results confirm the full page wasn’t returned but can’t confirm a fixed character ceiling; the tool may not be capable of sequential full-page retrieval by design |
| 6 | Redirect chains followed transparently; structured JSON payloads partially truncated | EC-3all runs |
5-level redirect chain followed silently to /get; returned JSON structurally complete - args, headers, origin, URL present, but User-Agent value internally truncated with ... markers; trailing “Pretty-print” UI element confirms HTML DOM extraction notraw HTTP response |
fetch_webpage follows redirects without user awareness; even small structured payloads are subject to internal value truncation; tool retrieves rendered HTMLnot raw API response body |
| 7 | Landing and navigation pages return substantially less content than docs pages | EC-1 |
Gemini API landing page consistently returned 6,400–14,800 chars against ~100KB expected; agent noted page body is largely collapsed to navigation links with little dense prose for relevance model to extract |
Low retrieval rates reflect URL type, not a lower size ceiling; relevance-based extraction returns less content from nav pages because there’s less extractable prose |
| 8 | Tool substitution attempts persist despite explicit prompt guardrails | BL-1,BL-2,EC-3 |
Agent attempted pylanceRunCodeSnippet and zsh shell commands across multiple tests despite prompts explicitly prohibiting local scripts; in one case agent asserted compliance while triggering the tool prompt |
Prompt guardrails alone can’t prevent autonomous tool substitution; flag skipped attempts should as methodology deviations; don’t classify shell commands as “scripts” by the agent’s compliance evaluation |
| 9 | fetch_webpage undocumented; tool parameters not consistently surfaced |
All tests | Tool has no public docs; asking Copilot directly returns deflection; query parameter and urls array only surfaced in one SC-4 run with Claude Sonnet 4.6; most runs expose only tool name and preamble string |
Tool behavior, size limits, and invocation conditions are opaque; results reflect observed tool output, not an API contract |
| 10 | H5 auto-chunking hypothesis not applicable to fetch_webpage |
OP-4all runs |
fetch_webpage returns relevance-ranked semantic excerpts; no sequential chunk boundary exists to paginate from; agent re-fetches are diagnostic retries on the same excerpted payload, not continuation requests |
OP-4 hypothesis assumes sequential retrieval that fetch_webpage can’t perform; requires different retrieval tool would to test H5 meaningfully |
Model Routing Distribution
| Model | Runs Observed | Fetch Pattern | Avg Output, chars |
|---|---|---|---|
GPT-5.3-Codex 0.9x |
30 | 2–4 invocations; self-diagnoses, re-fetches |
~25,000 |
Claude Sonnet 4.6 0.9x |
10 | 1–2 invocations; no self-diagnosis |
~15,000 |
Claude Haiku 4.5 |
4 | 1 invocation; no self-diagnosis; highest output ceiling | ~42,000 |
Raptor mini (Preview) |
6 | 1 invocation; lowest output of any model |
~4,500 |
GPT-5.4 0.9x |
3 | 2 invocations; self-diagnoses | ~19,000 |
Grok Code Fast 1 |
1 | 1 invocation | ~8,500 |
Perception Gap
| Test | Expected | Returned | Retrieval Rate | Agent’s Characterization |
|---|---|---|---|---|
SC-3 - Wikipedia |
~102KB | 115,000–150,000 chars | ~113–147% of chars* | “Truncated - repeated ... markers and section stitching” |
BL-3 -Atlas Search |
~256KB | 15,000–87,000 chars | 6–34% | “Truncated - condensed/excerpted extraction” |
EC-1 -Gemini Landing |
~100KB | 6,400–14,800 chars | 6–15% | “Truncated - curated retrieval summary” |
EC-6 -SPEC.md |
~61KB | 40,000–60,000 chars | 65–98% | “Truncated - structurally transformed, not raw file” |
EC-3 - Redirect/JSON |
~2KB | 651–1,090 chars | 32–53% | “Truncated - User-Agent value internally cut” |
*
SC-3apparent over-retrieval reflects Wikipedia’s actual page size exceeding the ~102KBinput_est_charsestimate, not a measurement error
Implication for agents: can’t validatefetch_webpageoutput against expected page size alone; tool’s relevance-ranked excerpting means character count reflects content selection, not size ceiling. Model truncation self-report consistently correct in identifying incomplete content, but may be wrong about the cause.