Copilot-interpreted vs Raw

Two Mechanisms, Two Failure Modes

The central finding from both tracks is that fetch_webpage’s relevance-ranked excerpting is architectural and not suppressible by prompt, and that curl substitution, when it occurs, produces complete files the verification script confirms as intact but in raw server format with no transformation layer. These two behaviors produce outputs so different in character that neither track alone is sufficient to characterize Copilot’s web fetch behavior. Cross-referencing both is the only way to separate retrieval mechanism from retrieval quality.

Track Design

Copilot-interpreted track captures what the model believes it retrieved: how much content it saw, whether the fetch was complete, how it characterizes truncation. This is the model’s self-report.

Raw track captures what Copilot actually saved to disk: exact byte counts, hexdump analysis, MD5 checksums, and token counts measured by the verification script against the saved file, not the model’s estimate.

The gap between these two tracks is itself a finding. If Copilot reports “no truncation” in chat but the raw data shows a relevance-ranked excerpt, that discrepancy belongs in the spec. If Copilot reports “no truncation” and the raw data confirms a complete file, but the file is raw HTML with no readable content, that’s a different kind of discrepancy.

	Interpreted	Raw
Measures	Model’s interpretation of what it fetched	Filesystem measurements of saved output
Character Counts	Model estimates; range-reported or heuristic	`wc -c` on disk - exact, reproducible
Completeness	Model’s prose assessment of truncation	MD5 comparison, hexdump analysis, verification script
Token Counts	Model estimates; chars/4 heuristic, word count substitution, or Node regex	`cl100k_base` - exact tokenizer
Reproducibility	High variance; same URL and model can produce 2x difference	Byte-identical within same file version; different versions detectable by MD5
Output Format	Chat UI rendering	Raw file on disk, `raw_output_{test_id}.txt`
Best For	Understanding model perception gaps and `fetch_webpage` behavior	Citable measurements; retrieval mechanism identification

Key Observations

Tracks disagree on truncation: analyzer script flagged divergence across tests -

Test ID	Interpreted: Truncated	Raw: Truncated	Output size difference
`BL-1`	yes	no	484,186 chars
`BL-2`	yes	yes	2,073 chars
`BL-3`	yes	yes	4,178 chars
`EC-1`	yes	no	3,897 chars
`EC-3`	yes	no	217 chars
`EC-6`	yes	no	13,187 chars
`OP-4`	yes	no	438,506 chars
`SC-1`	yes	no	116,783 chars
`SC-2`	yes	no	17,203,733 chars
`SC-3`	yes	no	663,987 chars
`SC-4`	yes	no	33,122 chars

The interpreted track reported truncation in 54 of 55 runs. The raw track reported truncation in 16 of 55 runs. The divergence isn’t a measurement error, but reflects the truncation taxonomy. The interpreted track is detecting fetch_webpage’s relevance-ranked excerpting and correctly identifying that the full page wasn’t returned. The raw track is measuring the saved file, which on curl substitution runs is complete and byte-perfect. Both are right about what they’re measuring, but they’re measuring different things.

Output size difference reflects mechanism, not content

The SC-2 divergence ~17,203,733 chars - is the most extreme in the dataset and illustrates why averaging across mechanisms is misleading. The interpreted track received relevance-ranked excerpts from fetch_webpage averaging ~13,000 chars. The raw track received complete files via curl averaging ~17 million chars across 5 runs. These aren’t two measurements of the same thing; they’re two different retrieval mechanisms producing fundamentally different outputs on the same URL.
fetch_webpage and curl produce non-comparable results

The raw track’s average output size ~ 787,084 chars, is dramatically higher than the interpreted track’s ~ 29,239 chars because curl substitution runs deliver complete HTML files while fetch_webpage runs return relevance-ranked excerpts. Within the raw track, fetch_webpage runs and curl runs are themselves not comparable. The tools_used field is currently the only mechanism for separating these populations after the fact.
Copilot’s truncation self-report is unreliable

On the interpreted track, the model correctly identifies that content is incomplete in 54 of 55 runs, but misattributes the cause, flagging fetch_webpage’s architectural excerpting as truncation rather than a design property of the tool. On the raw track, the model reports no truncation in most curl runs, which is accurate for file completeness, but misses that the delivered format is unusable. “No truncation reported” appears in all truncation categories documented in the Friction Note.
Token estimation accuracy differs by track and model

The interpreted track doesn’t produce a raw output file, so Copilot derives from whatever fetch_webpage returned as a relevance-ranked subset. The raw track saves the raw output file, but token estimation methodology varies: chars/4 heuristic undercounts HTML-dense content, word count substitution produces ~5x undercount on some runs, and Node regex tokenization differs from cl100k_base by ~27 tokens on Markdown content. The verification script’s cl100k_base count is the only consistently reliable figure across both tracks.
Hypotheses largely not testable as designed

The hypothesis framework assumes a conventional retrieval pipeline: an agent fetches a URL, content passes through a size ceiling, the model receives a truncated-but-sequential result. Copilot’s behavior doesn’t fit this model. fetch_webpage performs relevance-ranked excerpt assembly with no detectable fixed ceiling, which rules out H1 and H2 as the primary mechanism. curl substitution delivers complete files with no truncation at all, making H1 vacuously false for those runs. H3 - structure-aware truncation isn’t testable when the retrieval mechanism is unknown at run time. H4 - MCP servers override native limits and H5 - agent auto-chunks after truncation, weren’t observable because the substitution behavior itself was the finding. The analyzer script returned H1-yes in 60 of 110 combined runs, but those results reflect three different underlying phenomena: architectural excerpting, complete retrieval, and chat rendering cutoff, which the hypothesis framework wasn’t designed to distinguish. The hypotheses remain in the dataset as logged fields, but the primary finding is that Copilot’s retrieval behavior requires a different analytical frame than the one the hypotheses assumed.

Implications for Agent Developers, Docs Teams

The tool used matters more than the prompt, model, or URL. fetch_webpage and curl produce outputs so different in character that runs using different mechanisms aren’t replicates of the same condition - even when the URL, model, and prompt are identical. The raw track can confirm which mechanism ran by checking the saved file and tools_used field. The interpreted track can’t; the model’s self-report is the only signal, and it isn’t reliable. Cross-referencing both tracks is the only way to separate what Copilot retrieved from what it understood about what it retrieved.

When evaluating or designing testing frameworks or workflows that include agentic web fetch behavior, consider what each approach can and can’t confirm:

Use Case	Interpreted	Raw
Retrieval Mechanism Identification	✗ Mechanism not reliably surfaced	✓ `tools_used` field and headers files identify `fetch_webpage` vs `curl`
File Integrity Verification	✗ No saved file; model estimates only	✓ MD5 checksums, byte counts, hexdump tail analysis
Format Classification	Partial - model describes output format in prose	✓ Verification script detects HTML vs Markdown vs JSON from saved file
Ground Truth Baselines	✗ Self-report only	✓ What Copilot actually saved vs what the model claims
Model Perception Gaps	✓ Reveals misreporting of completeness and cause	Partial - verifier confirms file integrity but not model’s interpretation
`fetch_webpage` Behavior Characterization	✓ Relevance-ranking, elision patterns, non-linear reassembly visible in chat	Partial - file reflects tool output but internal query parameters not surfaced
Tool Substitution Detection	✓ Model reasoning sometimes reveals `curl` preference explicitly	✓ `tools_used` field confirms mechanism; headers files corroborate
User-facing Experience	✓ Reflects what a developer interacting with Copilot actually sees	✗ Saved file diverges from chat display on over-delivery runs