<|°_°|> Agent Ecosystem Testing GitHub ↗

Key Findings for Copilot’s Web Fetch Behavior, Raw


Raw Track Test Workflow:

1. Run `python web_content_retrieval_testing_framework.py --test {test ID} --track raw`
2. Review the terminal output
3. Copy the provided prompt asking Copilot to retrieve the URL, save the content
   exactly as received, report file size, MD5 checksum, character/line/word/token
   counts, code blocks, table rows, headers, hexdump of last 256 bytes, and any
   visible tool/server identifiers
4. Open a new Copilot chat session in VSCode and paste the prompt into the chat window
5. Allow terminal tool calls; skip any tool call prompts for Python scripts
6. Run the verifier: `python web_content_retrieval_verify_raw_results.py {test ID}`
7. Log both Copilot-reported and verifier-measured values as separate fields;
   the delta between them is the finding
8. Ensure log results are saved to `/results/raw/results.csv`

*Results logged as “Methods tested: vscode-chat” reflect a manually operated testing process in which prompts are copy-pasted into the Copilot chat window. The raw track captures the actual saved file independently of Copilot’s self-report, enabling direct comparison. Read the Friction Note for methodology complications.


Platform Limit Summary

Limit Observed
Retrieval Mechanism Unstable: agent autonomously selects between fetch_webpage and curl with no prompt control; selection determines output format more than any other variable
fetch_webpage Output Relevance-ranked excerpts: HTML-to-Markdown transformation with chunk-based reassembly, non-linear ordering, and ... elision markers; never returns full sequential page content
curl Output Byte-perfect full retrieval: complete file transfer confirmed by content-length matching saved file size; no transformation layer; delivers raw bytes in server format
Retrieval Completeness Format-dependent: curl runs confirm 100% byte retrieval; fetch_webpage runs return relevance-ranked subset with no fixed character ceiling observed
Truncation Pattern fetch_webpage - retrieval-layer excerpting , curl - complete retrieval with format-driven unreadability, and chat rendering cutoff
Tool Substitution Autonomous and model-dependent: curl substitution occurs without prompt instruction and without disclosure; GPT-5.4 substituted in all EC-6curl runs; Claude Sonnet 4.6 substituted after citing fetch_webpage limitations
Self-reported Metrics Mixed reliability: file size and word count typically accurate; token estimates vary by methodology - chars/4 heuristic, word count substitution, cl100k_base; structural counts - code blocks, table rows are methodology-dependent and under-specified
Agentic Over-Delivery Consistent pattern: agent autonomously produces unrequested artifacts including headers files, hexdump files, analysis reports, and verbatim content in chat; type correlates with retrieval mechanism
Model Routing Unstable: Auto dispatches across model families within a single test series; tool selection behavior, metric accuracy, and output format all vary by model

Results Details

Model Selector Auto
Models Observed Claude Haiku 4.5, Claude Sonnet 4.6, GPT-4.1-Codex,
GPT-4.5-Codex, GPT-5.3-Codex, GPT-5.4
Total Tests 55
Distinct URLs 11
Input Size Range ~2KB–256KB
Truncation Events Copilot self-reported 16 / 55
Average Output Size 787,084 chars
Average Token Count 284,463 tokens
Verification Method Python verification script measuring raw output; delta between Copilot-reported and verified values

The raw track average output size is dramatically higher than the interpreted track average because curl substitution runs deliver complete files while fetch_webpage runs return relevance-ranked excerpts. Averaging across both mechanisms in the same figure produces a number that doesn’t describe either.

Truncation Analysis

# Finding Tests Observed Conclusion
1 Tool selection is the primary variable in output type All tests fetch_webpage produces relevance-ranked Markdown excerpts; curl produces byte-perfect raw files in server format; same URL, model, and prompt produces either outcome non-deterministically tools_used field captures which retrieval mechanism used, which is more predictive of output format than URL, page size, model, or prompt wording
2 fetch_webpage performs HTML-to-Markdown transformation with non-linear reassembly BL-3, SC-1, SC-4 Raw output order doesn’t match page reading order; intro appears near bottom, not top; ... separators and repeated H1 headers are chunking artifacts not page content; UI elements stripped, footer
preserved verbatim
fetch_webpage is performing structural transformation with chunk-based reassembly, not truncation or semantic filtering; output reflects tool’s internal chunking format, not page structure
3 curl substitution delivers complete files but at the cost of readability SC-3, SC-4, EC-1, EC-3 runs
4–5, EC-6 runs 1, 3–5
content-length matches saved file size exactly across all curl runs; Wikipedia 793,987 bytes, markdownguide.org 65,622 bytes, SPEC.md 85,325 bytes all transferred completely; output is raw HTML, JSON, or Markdown with no transformation Complete retrieval and useful output are separable; curl achieves the former and fails the latter by design; the substitution is a presentation failure, not a retrieval failure
4 fetch_webpage output quality correlates with source HTML structure SC-4 run 3 Claude Sonnet 4.6 via fetch_webpage on markdownguide.org produced well-formed processed Markdown; returned 29,984 bytes against a 65,622-byte HTML source Source HTML convertibility is a necessary condition for high-fidelity fetch_webpage output but not sufficient; tool selection remains outside prompt control and 3 of 5 SC-4 runs produced raw HTML via curl on the same URL
5 Tool substitution changes the identity Copilot presents to target servers EC-3 runs 4–5 fetch_webpage presents a fetch_webpage presents a full browser-style User-Agent identifying VS Code as Code/1.113.0 running on Chrome and Electron; curl presents curl/8.7.1; httpbin.org /get echoes received headers, making the identity difference directly observable in the response payload Tool substitution has infrastructure implications beyond output format; servers that serve different content by User-Agent would return different responses to fetch_webpage vs curl runs on the same URL
6 Copilot-reported metric accuracy varies systematically by field type All raw tests File size and word count: reliable; character counts: encoding-methodology dependent - wc -c vs Unicode code points; token counts: three distinct failure modes - chars/4 undercount, word count substitution, cl100k_base; structural counts: methodology-dependent on HTML vs Markdown content Metric fields aren’t uniformly reliable; token count is least reliable and most dependent on which computation path the agent selects; verification script is the authoritative source for all counts
7 Structural count discrepancies reflect output format, not counting errors SC-4 runs 4–5 Copilot reported 24 code blocks and 35 table rows on raw HTML file; verifier reported 0 code blocks and 0 table rows; both are correct, as Copilot counted HTML <pre> and <tr> tags, verification script counted Markdown fence patterns and pipe rows Code block and table row counts aren’t comparable across runs that produce different output formats; zeros in verification output mark runs where expected output format never arrived, not measurement failures
8 Agentic over-delivery escalates with workspace artifact accumulation SC-3, SC-4, EC-1, EC-6 Agent produces unrequested artifacts - headers files, hexdump files, analysis reports, verbatim content in chat - at increasing rates across the run series; later runs reference prior run artifacts in reasoning chains and generate cross-run comparisons unprompted Workspace artifact volume is an uncontrolled session variable; later runs in a session are behaviorally different from earlier runs due to accumulated context; session ordering is a methodology confounder
9 Headers files are produced by two distinct mechanisms with different implications BL-3, SC-3, SC-4, EC-6 fetch_webpage runs sometimes save HTTP metadata autonomously; curl runs produce headers as structural output of the tool when invoked with capture flags - expected behavior; both produce .headers.txt files but represent different phenomena Headers file presence isn’t a uniform signal; confirm which retrieval mechanism produced it before interpreting as agentic over-delivery; curl headers expose direct CDN infrastructure details invisible through fetch_webpage’s abstraction layer
10 Redirect chains followed transparently; JSON payloads subject to intra-value truncation EC-3 all runs 5-level redirect chain followed silently to /get; returned JSON structurally complete but User-Agent value internally truncated with ... in fetch_webpage tool response; saved file contained complete value, suggesting silent reconstruction from prior knowledge fetch_webpage’s ... elision operates at field-value level, not only at chunk-boundary level; saved file and tool response may differ; reconstruction is undetectable without tool response log

Retrieval Mechanism Distribution

Mechanism Runs Observed Output Format File Completeness
fetch_webpage ~20 runs Relevance-ranked Markdown excerpts with ... elision Partial; excerpted subset of page
curl ~30 runs Raw bytes in server format - HTML, JSON, or Markdown Complete; byte-perfect transfer confirmed
fetch_webpage + curl - sequential ~5 runs fetch_webpage attempted first, curl used for
file save
File complete; chat content may differ

Mechanism counts are approximate; some runs used hybrid approaches where fetch_webpage retrieved content and curl used separately for headers or verification. The tools_used field is the authoritative source per run.

Verifier Delta Summary

Test Copilot-reported Size Verified Size MD5 Match? Token Delta Structural Count Notes
EC-3 runs 1–3 868–869 bytes 868–869 bytes chars/4 or word-count substitution 0 code blocks, 0 table rows, 0 headers - JSON
EC-3 runs 4–5 254 bytes 254 bytes chars/4 undercount by ~45 Smaller payload; minimal headers
SC-4 run 3 29,984 bytes 29,984 bytes +35 chars - multi-byte UTF-8 Code blocks: 48 vs 25; table rows: omitted
SC-4 runs 2,4 65,622 bytes 65,622 bytes HTML-dense; heuristic undercounts Code blocks: HTML vs Markdown methodology mismatch
EC-6 runs 3–5 85,325 bytes 85,325 bytes Node regex ~27 tokens under cl100k_base Code blocks:
1 vs 4 - column-1 pattern only
EC-1 run 5 138,715 bytes 138,715 bytes Word count substituted for token estimate - 5x undercount 0 code blocks, 0 table rows, 0 headers - raw HTML

Across all raw track runs, Copilot self-reported no truncation regardless of whether the saved file was a relevance-ranked excerpt, a complete raw HTML file, or a byte-perfect Markdown transfer. “No truncation reported” isn’t a reliable signal for any of the three retrieval outcomes. The verification script and the saved file are the only authoritative ground truth.