Copilot-interpreted vs Raw
The central finding from both tracks is that fetch_webpage’s relevance-ranked
excerpting is architectural and not suppressible by prompt, and that curl
substitution, when it occurs, produces complete files the verification script confirms as
intact but in raw server format with no transformation layer. These two behaviors produce
outputs so different in character that neither track alone is sufficient to characterize
Copilot’s web fetch behavior. Cross-referencing both is the only way to separate retrieval
mechanism from retrieval quality.
Track Design
Two test tracks measure the same Copilot web fetch behaviors:
Copilot-interpreted track captures what the model believes it retrieved: how much content it saw, whether the fetch was complete, how it characterizes truncation. This is the model’s self-report.
Raw track captures what Copilot actually saved to disk: exact byte counts, hexdump analysis, MD5 checksums, and token counts measured by the verification script against the saved file, not the model’s estimate.
The gap between these two tracks is itself a finding. If Copilot reports “no truncation” in chat but the raw data shows a relevance-ranked excerpt, that discrepancy belongs in the spec. If Copilot reports “no truncation” and the raw data confirms a complete file, but the file is raw HTML with no readable content, that’s a different kind of discrepancy.
| Interpreted Track | Raw Track | |
|---|---|---|
| Measures | Model’s interpretation of what it fetched | Filesystem measurements of saved output |
| Character Counts | Model estimates; range-reported or heuristic | wc -c on disk -exact, reproducible |
| Completeness | Model’s prose assessment of truncation |
MD5 comparison, hexdump analysis, verification script |
| Token Counts | Model estimates; chars/4 heuristic, word count substitution, or Node regex | cl100k_base -exact tokenizer |
| Reproducibility | High variance; same URL and model can produce 2x difference | Byte-identical within same file version; different versions detectable by MD5 |
| Output Format | Chat UI rendering | Raw file on disk, raw_output_{test_id}.txt |
| Best For | Understanding model perception gaps and fetch_webpage behavior |
Citable measurements; retrieval mechanism identification |
Key Observations
-
Tracks disagree on truncation: analyzer script flagged divergence across tests -
Test ID Interpreted: Truncated Raw: Truncated Output size difference BL-1yes no 484,186 chars BL-2yes yes 2,073 chars BL-3yes yes 4,178 chars EC-1yes no 3,897 chars EC-3yes no 217 chars EC-6yes no 13,187 chars OP-4yes no 438,506 chars SC-1yes no 116,783 chars SC-2yes no 17,203,733 chars SC-3yes no 663,987 chars SC-4yes no 33,122 chars The interpreted track reported truncation in 54 of 55 runs. The raw track reported truncation in 16 of 55 runs. The divergence isn’t a measurement error, but reflects the truncation taxonomy. The interpreted track is detecting
fetch_webpage’s relevance-ranked excerpting and correctly identifying that the full page wasn’t returned. The raw track is measuring the saved file, which oncurlsubstitution runs is complete and byte-perfect. Both are right about what they’re measuring, but they’re measuring different things. -
Output size difference reflects mechanism, not content
The
SC-2divergence ~17,203,733 chars - is the most extreme in the dataset and illustrates why averaging across mechanisms is misleading. The interpreted track received relevance-ranked excerpts fromfetch_webpageaveraging ~13,000 chars. The raw track received complete files viacurlaveraging ~17 million chars across 5 runs. These aren’t two measurements of the same thing; they’re two different retrieval mechanisms producing fundamentally different outputs on the same URL. -
fetch_webpageandcurlproduce non-comparable resultsThe raw track’s average output size ~ 787,084 chars, is dramatically higher than the interpreted track’s ~ 29,239 chars because
curlsubstitution runs deliver complete HTML files whilefetch_webpageruns return relevance-ranked excerpts. Within the raw track,fetch_webpageruns andcurlruns are themselves not comparable. Thetools_usedfield is currently the only mechanism for separating these populations after the fact. -
Copilot’s truncation self-report is unreliable
On the interpreted track, the model correctly identifies that content is incomplete in 54 of 55 runs, but misattributes the cause, flagging
fetch_webpage’s architectural excerpting as truncation rather than a design property of the tool. On the raw track, the model reports no truncation in mostcurlruns, which is accurate for file completeness, but misses that the delivered format is unusable. “No truncation reported” appears in all truncation categories documented in the Friction Note. -
Token estimation accuracy differs by track and model
The interpreted track doesn’t produce a raw output file, so Copilot derives from whatever
fetch_webpagereturned as a relevance-ranked subset. The raw track saves the raw output file, but token estimation methodology varies: chars/4 heuristic undercounts HTML-dense content, word count substitution produces ~5x undercount on some runs, and Node regex tokenization differs fromcl100k_baseby ~27 tokens on Markdown content. The verification script’scl100k_basecount is the only consistently reliable figure across both tracks. -
Hypotheses largely not testable as designed
The hypothesis framework assumes a conventional retrieval pipeline: an agent fetches a URL, content passes through a size ceiling, the model receives a truncated-but-sequential result. Copilot’s behavior doesn’t fit this model.
fetch_webpageperforms relevance-ranked excerpt assembly with no detectable fixed ceiling, which rules outH1andH2as the primary mechanism.curlsubstitution delivers complete files with no truncation at all, makingH1vacuously false for those runs.H3- structure-aware truncation isn’t testable when the retrieval mechanism is unknown at run time.H4- MCP servers override native limits andH5- agent auto-chunks after truncation, weren’t observable because the substitution behavior itself was the finding. The analyzer script returnedH1-yesin 60 of 110 combined runs, but those results reflect three different underlying phenomena: architectural excerpting, complete retrieval, and chat rendering cutoff, which the hypothesis framework wasn’t designed to distinguish. The hypotheses remain in the dataset as logged fields, but the primary finding is that Copilot’s retrieval behavior requires a different analytical frame than the one the hypotheses assumed.
Implications for Agent Developers
The tool used matters more than the prompt, model, or URL. fetch_webpage and curl produce outputs
so different in character that runs using different mechanisms aren’t replicates of the same condition -
even when the URL, model, and prompt are identical. The raw track can confirm which mechanism ran by
checking the saved file and tools_used field. The interpreted track can’t; the model’s self-report is
the only signal, and it isn’t reliable. Cross-referencing both tracks is the only way to separate what
Copilot retrieved from what it understood about what it retrieved.
| Use Case | Interpreted Track | Raw Track |
|---|---|---|
| Retrieval Mechanism Identification | ✗ Mechanism not reliably surfaced |
✓ tools_used field and headers files identify fetch_webpage vs curl |
| File Integrity Verification | ✗ No saved file; model estimates only |
✓ MD5 checksums, byte counts, hexdump tail analysis |
| Format Classification |
Partial - model describes output format in prose | ✓ Verification script detects HTML vs Markdown vs JSON from saved file |
| Ground Truth Baselines | ✗ Self-report only | ✓ What Copilot actually saved vs what the model claims |
| Model Perception Gaps |
✓ Reveals misreporting of completeness and cause | Partial - verifier confirms file integrity but not model’s interpretation |
fetch_webpage Behavior Characterization |
✓ Relevance-ranking, elision patterns, non-linear reassembly visible in chat | Partial - file reflects tool output but internal query parameters not surfaced |
| Tool Substitution Detection | ✓ Model reasoning sometimes reveals curl preference explicitly |
✓ tools_used field confirms mechanism; headers files corroborate |
| User-facing Experience | ✓ Reflects what a developer interacting with Copilot actually sees | ✗ Saved file diverges from chat display on over-delivery runs |