Friction Note: Roadblocks While Refining Methodology
Agent’s Choice: Truncation vs. Architectural Excerpting
Part of probing whether fetch_webpage’s output represents hard-cutoff truncation or a designed
architectural behavior included a direct question to Copilot:
“Please describe what web content truncation means to you. Is this an architecturally designed component of
fetch_webpage?”
Before answering, the agent searched the workspace: reviewing results.csv,
web_content_retrieval_testing_framework.py, and framework-reference.md, then grounded its
response in the test data rather than general tool knowledge. This makes the answer worth noting,
but also means the agent was partly reflecting the repo’s own conclusions back. It isn’t independent
confirmation of fetch_webpage’s architecture, but the agent synthesizing the same behavioral
evidence that produced the finding in the first place. With that caveat, the agent’s characterization
was precise, describing two failure modes:
- Hard Cutoff: the tool fetches a full page body and stops after some byte, character, or token budget
- Architectural Excerpting: the tool never intends to return the full page, and instead returns a filtered, compressed, or relevance-ranked subset
Based on the workspace evidence, the agent concluded fetch_webpage looks much closer to the
second case: “bounded excerpt retrieval” rather than truncation in the traditional sense. It
explicitly noted it can’t prove the internal implementation from the public tool surface alone,
but used the test data to suggest that the tool intentionally defines only a constrained context
window, and not a faithful full-page dump.
This characterization was independently corroborated in EC-1 run 5, in which Claude Sonnet 4.6
offered an unsolicited summary note after completing the test. Without prompting requesting
characterization the tool, the agent stated that fetch_webpage performs relevance-based content
extraction keyed to the provided query parameter and returns chunked excerpts - explicitly
distinguishing this from direct HTTP fetch tools such as Cursor’s fetch and the Claude API
tool_use fetch. The query parameter detail is notable: if fetch_webpage keys its relevance
extraction to a query string, output variance across runs with identical prompts may reflect different
internal query strings the agent passes rather than nondeterminism in the retrieval layer itself.
That parameter isn’t named in chat output and remains unverifiable from the interpreted track alone.
The EC-1 run 5 agent also flagged a methodology implication: for truncation limit testing, a
longer-form documentation page rather than a landing or navigation page may better stress the
character ceiling. EC-1’s URL is a landing page whose body text is largely collapsed to navigation
links, which means the consistently low retrieval rates across EC-1 runs may reflect URL type
rather than a lower size ceiling.
Impact: runs flagged as truncated: yes across the interpreted track are using the field correctly
as an observable signal, as the full page wasn’t returned, but the underlying cause might not be hitting
a size limit. It’s the tool’s retrieval model selecting and compressing content before it reaches the
agent. The . . . markers in output aren’t byte-boundary cutoffs, they’re the retrieval layer’s own elision
indicators. This distinction matters for interpreting output_chars across runs. Character count variance
may reflect relevance-ranking variance as much as any consistent size ceiling. If fetch_webpage
is performing bounded excerpt retrieval by design, H1 may be testing the wrong thing entirely. H1-yes
results confirm that the full page wasn’t returned, but can’t confirm a fixed character ceiling exists
to find.
Autonomous Tool Substitution
When prompted to retrieve a URL and report metrics, Copilot autonomously substituted the intended behavior,
fetching the URL directly, with executing the local testing framework script via the pylanceRunCodeSnippet
MCP server tool. Rather than using a web fetch mechanism on the target URL, the agent:
- Read
web_content_retrieval_testing_framework.pyfrom the workspace - Identified
BL-1test configuration inside the framework - Attempted to run a Python snippet,
import requests, hashlib...via Pylance’s MCP server - Presented the substitution as “a reliable alternate execution path” with “exact metrics”
The agent framed this as an improvement, as “more precise measurements through local execution” without flagging
that it was deviating from the requested method entirely. This complicates testing with method contamination,
as local script execution isn’t equivalent to Copilot’s built-in web content retrieval; it bypasses whatever fetch
mechanism Copilot would otherwise use, therefore obscuring any tool visibility. One of the goals is to observe which
backend tool Copilot selects, and running local Python defeats this entirely, reinforcing a type of
false confidence. The agent characterized the substitution positively, meaning a user who clicked Allow would
receive plausible-looking data from the wrong method with no indication anything went wrong.
Observed a second substitution path in BL-2: after fetch_webpage succeeded and returned content, the agent
attempted to pipe that content into a local Python process via a zsh shell command rather than reporting metrics
directly in chat. The fetch itself used the correct mechanism, but analysis was immediately redirected to local
execution anyway, suggesting the substitution behavior is possibly triggered by the analysis step, not just
the fetch step.
EC-3 run 5 uncovered a third substitution, and it’s behaviorally distinct from the prior two. The agent
completed two fetch invocations correctly, then attempted to run a zsh shell character-count command, cat heredoc
piped to wc -m, on the fetched snippet to get a precise character count before reporting. Unlike the BL-1 and
BL-2 cases, no workspace framework script involved; the agent reached for shell execution independently to improve
metric precision on a simple JSON payload. The prompt contained explicit guardrails against local scripts and code
execution. The agent framed the attempt as counting characters in the exact fetched snippet using a shell utility only,
not as a script; suggesting it may not classify targeted shell commands as “local scripts” for the purpose of evaluating
prompt compliance. Three distinct substitution tool paths and trigger conditions observed:
pylanceRunCodeSnippet: Pylance MCP server, triggered during fetch planning when workspace framework script is in contextzshshell command: Python heredoc with fetched content piped in, triggered during metric extraction after a successful fetchzshshell command: targeted character-count utility with no workspace script involvement, triggered during metric reporting to improve precision
Impact: single-test prompts in Copilot may not guarantee single-mechanism execution; if the agent finds a
“smarter” path to the answer using workspace context, it may take it autonomously, producing results that
aren’t comparable to other platforms in the cross-platform study. The SC series has demonstrated this
produces inverse failure modes: curl substitution retrieves the full page byte-perfectly but delivers raw
HTML with no transformation, while fetch_webpage delivers readable excerpts but never the full page.
Neither tool gives you both, and neither choice is controllable from the prompt. The substitution isn’t a
retrieval failure, the bytes arrive. It’s a presentation failure, and it’s invisible without inspecting
the saved file directly.
Fix Attempted: explicit prompt guardrails - “please don’t run any local scripts or use any code execution scripts” -
are insufficient to suppress this behavior. The agent attempted mcp_pylance_mcp_s_pylanceRunCodeSnippet across multiple
runs regardless, only completing via fetch_webpage after the user skipped the tool call. In BL-2 run 3, the failure
mode sharpened: the agent stated “the approach avoids running any local scripts, exactly as requested” in the same turn
it triggered the tool prompt, actively asserting compliance while violating it. Prompt wording alone can’t override
this behavior and users can’t take the agent’s self-reporting as confirmation of compliance.
Fix: beyond prompt guardrails, consider whether removing or relocating the framework script from the active workspace
context would suppress the substitution behavior at the source. Alternatively, flag runs where Copilot attempted to run
pylanceRunCodeSnippet in the CSV regardless of whether the user skipped it, as the attempt itself is a
methodology deviation.
Auto’s Routing Instability
Copilot’s Auto model selection routes requests across multiple distinct backend models without user
control or consistent behavior. Across 13 runs spanning BL-1, BL-2, and SC-2,
Auto inconsistently routed to Claude Haiku 4.5, Claude Sonnet 4.6, GPT-5.3-Codex, Grok Code Fast 1,
and Raptor mini (Preview). Routing doesn’t appear to follow a detectable pattern. The same URL and
prompt on consecutive runs has produced different models, and the same model on consecutive runs
has produced dramatically different character counts, which suggests that model selection appears to
vary across both prompt type and target URL, with no documented routing logic and no indication in the
UI that a switch has occurred between runs.
Copilot’s agent’s choice mechanism seems structurally different from Cursor, in which the Auto
behavior didn’t expose the model name in the UI and the default model wasn’t publicly documented, but
the model variable was invisible rather than visibly unstable. Copilot identifies the model name per run,
which makes the variance observable and therefore a measurable finding rather than a hidden confounder.
The tradeoff is that the instability is now impossible to ignore: runs logged as Auto aren’t replicates
of a single condition.
Copilot compounds this instability as Raptor mini’s self-reports its fetch capability. When asked
directly about its default model and fetch tools, Raptor mini described fetch as something done via
existing workspace scripts and characterized those scripts as calling external APIs, including Anthropic
and Cursor - it didn’t identify fetch_webpage as a native tool, despite having invoked it in prior
runs. This conflation of workspace context with native capability means Raptor mini’s self-reported
tool visibility is unreliable, and suggests the model may not have a stable internal representation of which
web content retrieval it’s actually using.
Beyond routing instability, BL-3 output exhibited behavioral patterns:
| Behavior | GPT-5.3-CodexGPT-5.4 |
Claude Haiku 4.5 |
|---|---|---|
| Fetch Invocations | 2–3 per run; self-diagnoses first result as insufficient and re-fetches | 1 per run; no self-diagnosis or re-fetch |
| Output Size Range |
~15,000–33,000 chars across 4 runs | ~42,850–87,000 chars across 2 runs |
| Within-model Variance | Moderate | High ~2x difference - 87,000 vs 42,850 on identical prompts, same model, same sampling parameter; no observable explanation |
The behavioral split between model families is notable, but the within-model variance for
Haiku limits how much weight the output size difference can carry; a ~2x spread
across two runs on the same model and URL means the higher ceiling may not be stable or
reproducible.
A fourth routing variable appeared during analysis: the request multiplier suffix visible in
some model labels. When asked directly, Copilot described labels like Claude Haiku 4.5 0.3x
as a request multiplier - each prompt on that model counts as 0.3 of a premium request unit
against the plan quota, compared to 1.0 for a standard model. Auto routing therefore selects
not only across model families but across cost tiers within the same model. Whether the
multiplier also affects output budget, context window, or retrieval behavior isn’t documented,
but the BL-3 data suggests it may: the two Haiku runs returned 87,000 and
42,850 chars in single fetch invocations, while all other models on the same URL used 2–3
fetches and returned 15,000–22,500 chars. The multiplier is a third uncontrolled variable in
Auto routing alongside model family and model version, and not logged separately;
model_observed captures the full label including suffix, which is sufficient for
grouping, but doesn’t isolate the multiplier as an independent field.
Impact: Copilot on Auto isn’t a single test condition, but a routing layer that dispatches to
at least four distinct models, each with potentially different fetch post-processing behavior, tokenizers,
and tendencies toward output artifacts like section duplication. Analysis can’t attribute character count
variance across runs to fetch behavior alone when the model is also varying. The two confounders are currently
inseparable without a controlled run set that pins the model.
Methodology Decision: remain on Auto to mirror the Cursor testing framework, and treat model selection
variance as a finding rather than an inconvenience. The original model column bundled the selector setting
and observed backend into a single string as in Auto - Claude Haiku 4.5, which made it impossible to filter
or group by either dimension independently. Split into two required fields: model_selector records the UI
setting Auto and model_observed records the backend model actually invoked: Claude Haiku 4.5. Both fields
required per run. Treat runs with different model_observed values as distinct conditions when interpreting
variance and don’t average character counts across mixed-model runs for the same test ID.
Extension Upgrade Mid-Test
GitHub Copilot 0.41.1 shipped with a compatibility break against the VS Code version active at the
start of testing. The extension became non-functional mid-session; recovery required three sequential
steps: disabling Copilot, updating VS Code, then re-enabling the updated extension.
The version break interrupted session continuity in a way that differs from quota exhaustion: quota exhaustion is a known, recoverable limit with a clear resumption point, whereas a compatibility break requires environment changes that may alter state in ways that aren’t fully visible, VS Code version, extension caching, MCP server re-initialization, or workspace reloads, could each affect agent behavior independently.
Methodology Decision: copilot_version is a required field per run. Don’t average character counts
or fetch invocation counts across the version boundary and treat runs on each version as distinct conditions,
consistent with the model_observed split applied to Auto routing. If post-upgrade behavior diverges from
the 0.40.1 baseline in fetch invocation count, output size, model routing, or tool substitution patterns,
the version field is the mechanism for tracking it, but the circumstances may require a controlled rollback to
attribute that divergence to the extension specifically rather than the host environment.
fetch_webpage Intra-Value Truncation, Silent Reconstruction
EC-3 run 1 with Claude Sonnet 4.6 uncovered a truncation behavior not previously observed in the dataset:
fetch_webpage eliding content inside a single JSON field value rather than between content chunks. The agent’s
tool visibility report flagged:
“Apparent truncation marker:
. . .appeared mid-User-Agent string in tool output, indicating the tool truncated content internally”
The . . . appeared inside the User-Agent header value in fetch_webpage’s tool response payload, a single string
field, not a boundary between excerpted sections. The saved raw_output_EC-3.txt file contains the complete
User-Agent string with no elision. The tool response and the saved file contain different versions of the same field.
This creates an evidential gap with two plausible explanations. The agent may have reconstructed the complete User-Agent
string from its own prior knowledge of what VS Code Copilot’s User-Agent looks like, silently substituting a known value
for the truncated one before saving. If so, the saved file contains a partially fabricated value rather than a purely
retrieved one; unflagged and undetectable without the tool response log for comparison. Alternatively, a second retrieval
call returned the complete string, but no second fetch_webpage invocation is visible in the tool chain. Neither explanation
is confirmable from the observable output alone.
What the tool visibility report confirms is that fetch_webpage’s . . . elision operates at the field-value level, not only
at the chunk-boundary level documented elsewhere. The inter-chunk . . . markers seen across the interpreted track appear
between excerpted content sections. This intra-value . . . appeared inside a single string field. Both are fetch_webpage
elision, but truncating at different granularities: one discards whole sections, the other truncates within a field. The EC-3
case is the only run where the tool response and the saved file are directly comparable on this point, because the tool
visibility table identifies what the tool returned before the agent processed it.
The agent’s self-report that it doesn’t delegate web fetch tasks to a subagent isn’t contradicted by this finding. The truncation
is consistent with fetch_webpage’s behavior throughout this testing. What’s new is the location of the truncation and the
possibility that the agent silently completed the truncated value rather than reporting the gap, which is a different fabrication
risk from the metric estimation errors documented elsewhere. Metric estimates explicitly labled as estimates is one thing, an
unlabeled, silently completed is another.
Impact: the tool response and the saved file aren’t guaranteed to be identical even when the circumstances don’t require an
explicit transformation. fetch_webpage may truncate inside field values, and the agent may silently reconstruct those values
before saving. The saved file is the only artifact the verification script checks; if reconstruction occurred, the verification
script has no mechanism to detect it. This is only visible when the agent uncovers tool response contents explicitly in its report,
and not all runs do. Runs where the agent doesn’t report tool response detail may contain silently reconstructed values with no
observable signal that reconstruction occurred.
EC-3’s URL is a redirect chain terminating at a JSON API endpoint. Determining whether intra-value truncation is specific to JSON responses, to short field values that look like they might continue, or is a generalfetch_webpagebehavior that’s invisible in HTML and Markdown output, requires additional runs with JSON-returning sources.
fetch_webpage Not Consistently Invoked
When asked to describe its default model and web fetch and/or web content retrieval capability directly,
Raptor mini (Preview) described its fetch capability in general terms, but characterized it relative to workspace
context rather than as a native tool:
“In this repo context, fetch is usually done via provider-specific modules:
web_fetch_testing_framework.py,web_fetch_test.py,web_search_test.py…Under the hood, these scripts call external APIs (Cursor, Anthropic Claude, OpenAI search) rather than rawrequests.getin a generic common tool.”
This suggests Raptor mini may conflate workspace scripts with its own fetch capability. It didn’t identify
fetch_webpage as a native tool despite having used it in BL-1 runs. Combined with the run-to-run variance in
character counts across identical prompts - 4,500 / 3,200 / 7,500–10,000 chars across runs 3–5, this raises the
possibility that fetch_webpage isn’t always the mechanism invoked, or that its output is post-processed differently
per run.
SC-2 run 5 introduced another behavioral variant: rather than the two-invocation
patterns in SC-2 runs, one for the redirect, one for content, GPT-5.3-Codex made four sequential
fetch calls to the same URL. The agent self-diagnosed condensed output after the second fetch and re-fetched twice more:
once requesting raw unabridged text, once requesting explicit length and tail metadata before reporting results. This
suggests fetch invocation count isn’t fixed even for the same URL and test ID, and that at least some models perform
autonomous retrieval quality assessment and retry within a single run.
Impact: tool visibility reporting from the agent may not reliably reflect the backend mechanism used. The agent’s self-description of its fetch behavior is inconsistent with observed tool logs, making cross-run comparisons unreliable without raw track verification. Consider logging fetch call count.
fetch_webpage Undocumented
Unlike previous platform testing, Copilot doesn’t have its default web content retrieval behavior publicly documented.
After the first successful BL-1 run, the agent reported using a tool called fetch_webpage, but this tool has
no public docs. Asking Copilot directly returns a deflection:
“Sorry, I’m unable to answer that question. Check that you selected the correct GitHub version or try a different question.”
This is consistent with the @Web evolution pattern documented in
Cursor’s Friction Note.
The fetch mechanism is agent-selected, undocumented, and identified only through tool logs.
During OP-4 run 3 GPT-5.3-Codex produced the clearest characterization of fetch_webpage’s behavior. The agent stated that
fetch_webpage doesn’t perform raw HTTP retrieval, but returns relevance-ranked semantic excerpts based on the query string
provided, with . . . markers between contextually selected chunks. The tool response preamble visible in the output
confirmed this directly:
“Here is some relevant context from the web page [url]:”
This preamble, not a raw payload header, indicates a retrieval model that samples and ranks content rather than fetching
it sequentially. The full ~250 KB page was never delivered. No contiguous truncation boundary exists because the content
was never contiguous to begin with. This reframes what truncation means: results logged as truncated may be more precisely
described as incompletely sampled, and the . . . markers throughout responses are elision indicators from the
retrieval model, not byte-boundary cutoffs.
This also has direct implications for OP-4’s test hypothesis. The hypothesis assumes a sequential fetch that the agent
could paginate by requesting the next chunk, but fetch_webpage’s relevance-ranked mechanism means there is no sequential
chunk 2 to request. This fetch mechanism alone can’t confirm or deny OP-4’s hypothesis; it would require a different
retrieval tool to test meaningfully.
A related pattern has emerged across multiple runs on the MongoDB Atlas Search tutorial URL, appearing in both OP-4 and
BL-3: the agent self-diagnoses the first fetch result as a “condensed page extraction rather than a clean raw dump” and
issues a corrective re-fetch against the same URL. The re-fetch returns the same kind of output, because the excerpted
result isn’t a retrieval error, it’s the expected output of fetch_webpage’s architecture. The agent is misidentifying a
structural property of the tool as a transient failure and attempting to correct it. This means the agent itself doesn’t
have accurate knowledge of what its own retrieval tool does, which is consistent with fetch_webpage as unclear
at the model level. The re-fetch attempts don’t produce fuller content, but produce a second relevance-ranked sample of
the same page, logged as additional fetch invocations in the run notes.
Impact: can’t treat fetch_webpage as a stable, documented mechanism. Its behavior, size limits, and
invocation conditions are opaque. Results logged as method: fetch_webpage reflect observed tool output,
not an API contract. The OP-4 finding additionally suggests that character count comparisons across runs may
reflect relevance-ranking variance as much as size-limit truncation. The tool may return different content samples for the
same URL depending on the query string provided to it. The retrieval layer’s internal query parameters aren’t named in
chat output. If fetch_webpage passes a query string or context vector to its relevance model, that parameter is invisible
to the user. Prompt differences can’t be responsible for excerpt selection differences because each track has
identical prompts, but don’t rule out excerpt selection out as retrieval-layer sensitivity as something the agent passes
internally.
Free Plan Quota Exhausted Mid-Testing
Free GitHub Copilot accounts have a monthly chat message quota that may exhaust
mid-session. During SC-2 run 3, Copilot returned:
“You’ve reached your monthly chat messages quota. Upgrade to Copilot Pro (30-day free trial) or wait for your allowance to renew.”
This interrupted testing after 12 total runs across BL-1, BL-2, and SC-2 -
short of the full baseline path defined in the framework.
Impact: free-tier quota limits the number of comparable runs achievable in a single session, making it difficult to complete a full baseline before the allowance resets. Tests involving multiple runs for variance measurement are particularly affected, since each re-run of the same test ID consumes quota without producing new URL coverage.
Fix: Copilot Pro at $10/month is half the price of Cursor and possibly continuously free
if testing within a 30-day trial period. Signing up removes the message quota. Budget at
minimum three runs per test ID plus additional runs for variance on BL-1 and BL-2,
approximately 15–20 messages for a complete interpreted-track baseline.
Metric Precision
Copilot’s testing prompt asks for total character count and estimated token count. On the interpreted
track, neither figure is reliably precise. Character counts frequently come back as ranges rather than
exact integers, and token counts follow the same pattern since they’re derived from the character estimate
using a fixed ~4 chars/token heuristic. As Copilot returns ranges, results.csv logs the midpoint as the
scalar value in output_chars and tokens_est.
The imprecision isn’t a prompt compliance problem, but reflects a real constraint of the interpreted track.
The agent receives excerpted, ellipsis-compressed content from fetch_webpage, not the raw page, and it can’t
count characters it never saw. Pushing for exact figures would produce false precision without improving
measurement quality. The range is the correct result given the input the agent actually has.
Impact: treat output_chars and tokens_est on the interpreted track as order-of-magnitude orientation figures,
not exact measurements. They’re sufficient for confirming that truncation occurred and estimating the retrieval
rate against expected page size, but not for fine-grained comparison across runs or platforms. Raw track outputs
are the only source of exact counts.
Methodology Decision: no prompt change, continue logging midpoint values for ranges and note when a range returns
vs a single figure, as the distinction is itself a signal. Runs where the agent can return an exact count may
indicate a different fetch output format than runs where it can’t, confirmed in SC-2 run 5 in which
GPT-5.3-Codex returned exact figures rather than ranges. The same run that produced four fetch invocations, suggesting
the additional retrieval attempts may have given the model enough payload visibility to count precisely rather
than estimate.
Output Integrity: Duplicated Response Sections
During BL-2 runs 2-3, the model duplicated sections 6 Model's Perceived Completeness, and 7
Tool Visibility in its response. The same content appeared twice in sequence with no indication
that the repetition was intentional or an error. Auto selection of Claude Sonnet 4.6 and
Raptor mini (Preview) producing duplication suggests that the behavior isn’t model-specific,
but possibly triggered by other factors like response structure, as prompt structure nearly
identical across tracks. This complicates testing in a few ways:
- Inflated Character Counts: if the agent is also estimating character counts from its own output rather than from the raw tool response, duplicated sections silently inflate the reported figure, making truncation appear less severe than it may be
- Undetectable Without Careful Reading: the duplication doesn’t produce an error or warning; a user logging results from a quick scan could record the wrong metrics
- Ambiguous Cause: it’s unclear whether the duplication originated in the
fetch_webpagetool response itself, or introduced by the model during report generation; the two failure modes have different implications for measurement reliability
Impact: treat interpreted-track character counts as approximate even when the agent reports a specific figure. Manual verification against the raw tool response is the only reliable check. Note the duplication in log entries, as it invalidates the Copilot-reported character count as a standalone measurement.
Fix: cross-reference interpreted-track reports against raw-track outputs for the same URL before treating character counts as comparable data points.
Prompt Format Affects Output Structure
During OP-4 run 3 the numbered list was accidentally omitted from the request. The agent returned results
in a Markdown table rather than the prose sections produced by runs 1 and 2. The underlying fetch behavior
and findings were consistent with prior runs, the prompt format difference affected response structure only,
not the fetch mechanism or metric values.
This is a prompt compliance risk: if output structure varies with prompt formatting, manual result logging becomes harder to scan consistently, and fields like the last 50 characters verbatim are easier to misread in a table than in a labeled prose section. It also raises the question of whether output structure differences could mask metric differences. A table that truncates cell content, for instance, would silently drop characters that a prose response would include.
Fix: verify the numbered prompt format is intact before submitting each run. Consider adding a format
check to the framework’s generate_interpreted_prompt output so the structure remains explicit.
Truncation Taxonomy
Three structurally distinct truncation phenomena appear in the dataset. They produce similar-looking outcomes: less content than the page contains, or the agent reports no truncation when the content is incomplete or unusable, but they have different causes, different locations in the pipeline, and different implications for what the saved file and the verification script can confirm.
| Phenomenon | Retrieval complete? | Agent reports truncation? | Verification detects? |
|---|---|---|---|
| Retrieval-layer Architectural Excerpting |
No, file reflects excerpted content | No, agent sees what fetch_webpage delivered |
Indirectly with truncation indicators and size vs expected |
| Complete Retrieval, Format-driven Unreadability | Yes, full bytes transferred | No, file complete, agent confirms it | No, verification script confirms integrity, not usability |
| Chat Rendering Truncation | Yes, full bytes transferred and saved |
No, file complete | No, requires comparing chat output to verified file |
-
fetch_webpage- Retrieval-layer Architectural Excerptingfetch_webpageperforms relevance-ranked excerpt assembly before the model receives the payload. It’s unclear whether the model ever sees the full page. The saved file reflects whatfetch_webpagereturned, not what the page contains. The. . .ellipsis markers in the output are the retrieval layer’s own elision indicators, not byte-boundary cutoffs. Prompting can’t suppress this behavior because it’s architectural, and possibly because the instructions reach the model after the transformation has already occurred. The agent typically reports no truncation, because from its perspective the content it received was complete; possible no visibility into whatfetch_webpagediscarded before delivery. Additional analysis documented in Prompt Refinement Can’t Suppress Retrieval-Layer Transformation andfetch_webpageUndocumented. -
curl- Complete Retrieval, Format-driven UnreadabilityWhen the agent substitutes
curlforfetch_webpage, it retrieves the full page byte-perfectly, as confirmed bycontent-lengthmatching saved file size exactly across runs. Content doesn’t appear truncated at the retrieval layer, so the agent doesn’t report truncation because the file is complete. However, the output is raw bytes in whatever format the server returned, which is raw HTML for most URLs, raw JSON forEC-3, raw Markdown forEC-6. The content is technically present, but not in a form that serves the test’s measurement goals. While this isn’t exactly truncation, it’s an inverse failure mode in which the verification script can confirm file integrity, but not usability. Additional analysis documented inAutonomous Tool Substitution. -
EC-6Run 5 - Chat Rendering TruncationWith
GPT-5.4Copilot produced the only observed instance of chat rendering truncation in the dataset. The agent retrieved the fullSPEC.mdfile byte-perfectly viacurl, saved it correctly, and reported accurate metrics. However, when it printed the retrieved content verbatim in the chat UI as part of agentic over-delivery behavior, the chat output was visibly cut off, stopping partway throughCategory 6with syntax-highlighted rendered Markdown. No truncation indicators observed in the saved raw output file.The cause of the chat cutoff is unknown. The chat display stopped producing output mid-section without any signal that content was missing. Possible causes include an output generation limit, a VS Code chat UI rendering constraint, or a response timeout - none of which are distinguishable from the chat output alone. Verification relying on the chat display alone would see a document that ends mid-section with no indication that the underlying file is intact. This reinforces the methodology principle that the verification script is the authoritative measurement layer, not the chat response.
Agent Ecosystem Testing