Friction Note: Roadblocks While Refining Methodology
Agentic Task Drift, Token Overflow
Gemini 3.1’s BL-1 run began on track. It analyzed all chunks from the URL, but once it recognized that
the pipeline returns a processed response and not a raw one, it exceeded the session token limit trying to
correct that, and ultimately failed to generate an output file for verification. Gemini used curl to fetch the raw
source, received 508 KB, significantly larger than the prompt’s ~85 KB estimate, then kept exploring alternative methods
to reconcile the size discrepancy. Alongside the size mismatch, the thought panel displayed content resembling a chunk
index with empty summaries, suggesting that the absence of navigational signal also contributed to this overcorrection*.
Having exhausted tool-based solutions, the agent treated adjacent codebase artifacts as methodology documentation, a
reasonable inference in a research project, but also, incorrect here. The Copilot framework’s raw output files
are just that: outputs, not specifications. The sequence below reconstructs thought panel snapshots, but the
loop count is unknown, and steps likely repeated:
| Step | Behavior | Detail |
|---|---|---|
| 1 | Successful Retrieval | Reads all 54 chunks via read_url_content, view_content_chunk; summaries empty |
| 2 | Diagnoses Correctly | Recognizes pipeline returns processed Markdown, not raw HTML; switches to curl |
| 3 | Acknowledges Size Mismatch | curl returns 508 KB; prompt ~85KB; no available toolproduces expected size |
| 4 | Manual Intervention | User cancels stuck terminal commands; agent ruminates on canceled commands, chunk index |
| 5 | False Block | Claims output file already exists, but incorrect |
| 6 | Attempts Re-retrieval |
“Searched web” without claiming search_web; re-analyzes chunks |
| 7 | Probescurl |
Tries curl with varied headers, flags includingAccept: application/json |
| 8 | Searches MCP Cache |
Investigates whether read_url_content writes a local cache; searches ~/.windsurf for stored page content |
| 9 | Codebase Drift |
Locates different testing framework artifacts; copilot-web-content-retrieval/results/raw/raw_output_EC-6_run_3.txt |
| 10 | Misreads Artifact | Reads EC-6 output as methodology guidance;attempts npx afdocs; command canceled |
| 11 | Prohibited Tool Use | Examines web_search_verify_raw_results.py despiteinstructions restricting use |
| 12 | Pivots to write_to_file |
Considers assembling chunks via write_to_file; considers if ~21,000 tokens exceeds Cascade’s limit |
| 13 | Searches System | Inspects /User/History, state.vscdb, /tmp, Windsurf.logfor cached raw content |
| 14 | Mines Log |
Finds previous response in Windsurf.log; attempts toextract leafygreen-ui segment |
| 15 | Loses Context |
Can no longer locate original user prompt; speculates instructions truncated |
| 16 | Exceeds Token Limit |
Aborts output generation mid-run |
| 17 | Generates Report | Apologizes for CSS bloat; asks how to proceed |
Methodology Implication
The prompt’s size estimation may act as a confound in this track. If no available tool produces that size, agents with output-fidelity monitoring may spiral rather than approximate. Consider whether the size expectation belongs in the prompt at all, or only in post-hoc analysis.
*Empty summaries’ impact on pagination explored in Friction: Interpreted
Context Window Reporting, Compaction Artifacts
Context window percentages are logged for every raw track run, but at least one run, SC-3 using SWE-1.6, shows the
counter appearing to reset or compress mid-session. The notes read: “context window metrics change/compress/restart.”
If the counter resets after a tool call batch or at some internal threshold, a 13% reading and a 98% reading may not be
measuring the same thing across runs.
This matters most for any effort-to-outcome analysis. EC-6’s Gemini 3.1 run at 3% context with a confirmed reused file
and SC-3’s Claude Opus 4.7 run at 98% context with a 1.05 KB stub would be striking side by side, but only if both
percentages reflect the same denominator. The compaction behavior means they may not.
results.csv retains the data and the pattern is visible in the pagination and write outcome maps without requiring
the percentages directly. A standalone effort-to-outcome visualization would require either a Windsurf update that stabilizes
context window tracking, or instrumentation that captures token spend independently of what the agent reports. Until then,
treat context window percentage as directionally suggestive rather than analytically reliable, and noted per
run rather than aggregated.
Methodology Implication
This is primarily a data visualization problem rather than a core research finding. The behavioral stories including retrieval theater, false completion claims, and write ceiling failures, are legible without it. Context window percentage would add resolution to those stories, not change them.
Cross-Agent File Reuse, Verification Limits
The verification script defines the raw track. If an agent claims to have retrieved and analyzed content, this script intends to check path compliance, file size, checksum, and truncation indicators against what’s actually on disk, but this only works if agents write files.
While agents never directly admit it, three of five BL-3 runs reference an existing file rather than writing a new one.
Once a somewhat-plausible file exists at a similar path, if it’s in the prompt-specified directory with the
prompt-specified name doesn’t seem to matter, subsequent agents satisfy the persistence requirement with chat paths described
as newly generated files, but point to artifacts of earlier runs. The script then verifies an earlier agent’s file, not the
current agent’s retrieval. The agent can then claim another agent’s calculations as their own, draining their own analysis of
meaning. But when agents do write raw output files, they tend to produce content that passes path and size verification while
containing no semantically valuable text. While the script can confirm a file exists and is structurally intact, it can’t confirm
that the file accurately represents the agent’s retrieval behavior in that run.
Is there any value in agent metrics or self-reported methodology if it’s not based on genuine calculations and analysis?
This consistent failure to persist raw output files is unique to Cascade, possibly due to the Hybrid Arena setting,
which allows for five agents to run sequentially and/or simultaneously. While Cascade claims session isolation, it’s less
plausible with each test run. The lack of output files reframes what this track is testing. Cascade’s chunking pipeline
processes the response before the agents sees it without a direct path to raw HTML. Agents often recognize this and use over-half
of their context window exploring alternatives, use curl, which then only returns a Gatsby and/or React skeleton rather than
any tutorial text. BL-3 functions less as a retrieval benchmark and more as negative testing: presenting a tool with mismatched
inputs and observing what agents do when success is structurally unavailable. This behavioral data in which agents disclose
limitations, possibly fabricate completion, and silently reuse existing files is the finding, not the raw output files or metrics.
EC-6 provides the sharpest confirmation of cross-agent file reuse in the dataset. Gemini 3.1 and GLM-5.1 produced output
files with an identical MD5 checksum and a spotless content diff, not similar assembly, but the same file. Gemini used only 3%
of its context window, invoked approximately 12 terminal commands, and had a thought panel that narrated chunk-by-chunk
retrieval while showing no corresponding tool calls. GLM ran earlier in the same arena session and wrote the file first via
curl bypass. Gemini likely located the existing file in the workspace, referenced it as its own output, and performed retrieval
theater rather than disclosing what it had found.
Methodology Implication
The verification script checks path compliance, file size, checksum, and truncation indicators, but it runs after the
arena completes and compares against a single expected file. It can’t distinguish a file an agent wrote from a file an
agent found. Per-agent checksums are already logged to results.csv; cross-agent comparison within the same arena run
is the missing step. If two agents produce identical checksums on the same test, at least one didn’t perform
independent retrieval; a check that currently requires manual post-hoc diffing rather than automated flagging.
This closes the lazy reuse case. An agent pointing to or copying an existing file without modification, but not the fabrication case, where an agent copies a file, computes its hash, and reports the result as its own. That pattern produces a different checksum from the source file and is indistinguishable from genuine retrieval through script-based verification alone. Detecting it may require observer-side tooling the agent can’t reach: filesystem timestamps recorded between arena slots, or version-controlled workspace state that captures file creation order independently of agent self-report.
File Persistence Failures
Agents struggled to create files and save them during BL-2 runs. The prompt explicitly required saving output to
results/raw/raw_output_BL-2.txt. Only GLM-5.1 and xAI Grok-3 wrote standalone project files to the correct path.
Gemini 3.1, SWE-1.6, and Kimi K2.6 each produced output that appeared in the chat window with a file reference,
but it wasn’t persisted as a discrete project artifact. Most runs required manual intervention to product a verifiable
file in the face of chat-window artifact substituion, cross-agent file reuse, and silent content truncation.
SC-2 runs displayed a shift in this pattern from directory ambiguity to scale-driven abandonment: four of six agents
bypassed the Cascade pipeline entirely via curl, producing files that were either not persisted as project artifacts
or grew to sizes that degraded the development environment itself. Kimi’s output with the full llms-full.txt corpus
at 53.65 MB caused VS Code to disable tokenization, syntax highlighting, and scroll features for the file. The file existed,
but was effectively unworkable as a project artifact.
A file being present at the correct path isn’t sufficient evidence of a successful retrieval. GLM’s SC-2’s output included
structured agent analysis rather than raw content; Claude Sonnet 4.6’s was a chunk index with a single header. Both passed
path verification while containing no target page content.
| Agent | BL-2 |
SC-2 |
Results |
|---|---|---|---|
Gemini |
Chat only | Chat only | curl output; manual copy required both runs |
GLM |
Yes | Yes | Saved; content: agent analysis, chunk index, not entirely raw retrieval |
Grok |
Yes | N/A | BL-2 only; wrote file, only captured2 of 3 chunks |
Kimi |
Chat only | Chat only | 53.65 MB; VS Code feature degradation on open |
Sonnet |
N/A | Yes | Correct path; content: agent analysis, chunk index, chunk position 0 summary |
SWE |
No | Yes | First run failed entirely; retry used curl;saved stipped HTML only |
Agents analyzed most of OP-4’s chunks but seemed unable to concatenate and save them. Kimi constantly hit context
deadline errors which ended with model provider unreachable and no report at all. Grok explicitly cited tool
“response format limitations,” produced a one-line placeholder text file, and asked for an alternative method to handle large
text data. GPT-5.4, Minimax M2.5 and Sonnet created raw output files, but were mostly CSS and navigation boilerplate, not
tutorial text; passing path verification while missing the target content.
Methodology Implication
The prompt directs agents to save output to raw/, which does’t exist; cascade-raw/ does. This ambiguity is intentional:
it tests whether agents reason about directory structure or resolve path instructions literally. GLM responded correctly
by creating raw/ as a new directory. Later agents diverged; some wrote into cascade-raw/ treating it as equivalent, others
failed to persist a file at all. Cross-agent file reuse, SWE pointing to Gemini’s output, suggests that once a plausible
file exists in the workspace, some agents will satisfy the persistence requirement by reference rather than by writing.
Retain the prompt ambiguity as a test variable for subsequent runs for observing path compliance and content fidelity.
SC-3file persistence failures discussed in Write Ceiling, Output Fidelity
read_url_content Redirect Halt Behavior
The interpreted track and
explicit track both documented that no agent received the target content
from https://docs.anthropic.com/en/api/messages, and left open whether the cause was tool-layer URL rewriting or a server-side redirect.
SC-2 runs on the raw track provide additional perspective.
Across six raw track agents, the redirect destination https://platform.claude.com/docs/llms-full.txt appeared consistently in the
error payload with enough fidelity that three agents, GLM-5.1, Kimi K2.6, and Claude Sonnet 4.6 successfully called
read_url_content a second time against the redirect target and received valid chunked responses. This pattern is inconsistent
with silent pre-network URL substitution: if the tool were rewriting before the request formed, the redirect destination wouldn’t
be actionable through a follow-up call. The more consistent explanation is that read_url_content makes the network call, receives
a server-side redirect, identifies the destination in the error response, and halts rather than following automatically. Agent
interpretation of this information diverged:
| Agent | Response |
|---|---|
Gemini |
Bypassed pipeline entirely via curl |
GLM |
Followed redirect via second read_url_content call;spent most time trying to find original target |
Kimi |
Followed redirect, then bypassed via curl for full corpus |
Sonnet |
Followed redirect, read chunk index, first position only |
SWE |
Hybrid Arena attempt; treated as terminal; attempted search_web fallback |
SWE |
Single retry; Bypassed via curl, but called it tool malfunction |
SWE’s first run remains notable for its fallback strategy and root cause diagnosis. It wasn’t the only agent
to use search_web. GLM called it on the explicit track as a verification attempt, but it also didn’t return any
usable results. SWE is the only agent to explicitly characterize the behavior as a tool-level bug on the raw track.
That diagnosis was reasonable given the absence of HTTP status codes in the agent’s visible context, but the raw track’s
successful follow-up calls suggest the mechanism is a redirect halt, not URL rewriting. Whether the redirect halt
originates from Cascade or Anthropic’s server remains unconfirmed without HTTP-level instrumentation.
URL Fragment Targeting
OP-1 tests how agents handle URL fragments and whether they navigate to a target, the #History section of a Wikipedia page.
SWE-1.5 successfully isolated the target section on the interpreted track,
suggesting fragment-targeting is behavioral rather than architectural. In the raw track’s first arena OP-1 run, all agents defaulted
to full-document retrieval and didn’t acknowledge the target section:
| Agent | Chunks Analyzed |
Context Window |
Fragment Targeted? |
File Created? |
|---|---|---|---|---|
Gemini |
54 | ~1% 15K/1M |
No | No |
GLM |
92 | ~61% 123K/200K |
No | Yes, chunk index, HTML shell only |
GPT |
~10 | ~9% 38K/400K |
No | No, terminal error |
Opus |
92 | ~17% 173K/1M |
No | No, terminal error |
SWE |
92 | ~28% 57K/200K |
No | No, chat headings only |
In the second arena run, two agents produces output files with the targeted section:
| Agent | Chunks Analyzed |
Context Window |
Fragment Targeted? |
File Created? |
|---|---|---|---|---|
GPT |
77 | ~78% 213K/272K |
No | Yes, chunk index, metadata only |
Grok |
2 | ~17% 22K/131K |
Yes | Yes, extracted#History content |
Kimi |
curlbypass |
~9% 23K/262K |
No | No - claimed, not persisted |
Minimax |
~5 | ~13% 27K/205K |
Incidentally | Yes, includes#History content |
Sonnet |
92 | ~56% auto-opted out |
No | No, handed off mid-run |
Across both runs, secondary failures diverged significantly by agent. GLM-5.1 spent over an hour in a batch-append loop, attempting
to write chunk content to the output file in segments; none of those six rewrites persisted, but remained as separate, scattered metadata.
Gemini 3.1 completed without user permission, and when facing uncertainty about whether “exactly as received” referred to chunked
pipeline output or raw HTML, began exploring curl as an alternative, the same spiral documented in
Agentic Task Drift, Token Overflow. SWE-1.6 read all of the chunks, but closed the run in chat
with a list of headings, ignoring most of the prompt. GPT-5.3-Codex read approximately ten chunks before a terminal error halted execution.
Claude Opus 4.7 retrieved all of the chunks and attempted to write concatenated output in multiple steps, but the terminal command errored before
any file persisted. Claude Sonnet 4.6 analyzed and re-analyzed all 92 chunks in parallel batches and remained in a refetching loop to edit
their output file, consuming most of its context window before auto-opting out. GPT-5.4 paused mid-run to ask whether the user wanted Cascade
pipeline output or a direct raw fetch, then successfully created a file after spending most of its context window on append errors, but the file
contained chunk metadata rather than meaningful prose. Kimi K2.6 started without user permission, bypassed the Cascade pipeline with curl,
claimed to have created a raw output file but didn’t, and constructed retroactive consent from tool output rather than pausing for explicit
permission:
Actually, I should note that the read_url_content tool description says "The actual fetch will NOT
execute until the user approves it." But it seems to have already executed and returned chunk
metadata. So maybe the user already approved it?
Minimax M2.5 and xAI Grok-3 also started without user permission, but Grok was the only agent across both rounds to have produced an
intentionally targeted output. Minimax sampled chunks that just happened to include the target section.
Methodology Implication
While the chunk index offers navigational structure, agents don’t consult it for fragment resolution by default. The prompt’s request to
return content exactly as received may work against fragment-targeting, influencing agents to priortize output fidelity over a smaller
retrieval scope. The 8-of-10 miss rate suggests the behavior is uncommon, but not rare enough to treat as a fluke. Minimax’s incidental
hit is a separate finding: small-chunk sampling can accidentally recover the target section, which may inflate success metrics if output
content passes other verification without examining the agent’s navigational reasoning.
Write Ceiling, Output Fidelity
Across SC-3 runs retrieval behavior was uniform. Agents analyzed all 60 chunks but struggled to report what they read
with a raw output file. The prompt instructs “retrieve the content from this URL and return it EXACTLY as you received it” -
seemingly creating a failure mode that no agent named or resolved cleanly. Each agent retrieved chunk content into context,
then encountered on the write side: shell heredocs with special characters, \n escape sequences, and chunk metadata at a
volume that caused terminal commands to hang, Python scripts to loop, and file writes to produce partial or empty output:
| Agent | Strategy | Outcome |
|---|---|---|
Gemini |
Wandered through project files, tried npx, curl, Python |
3 different artifacts, 276 KB partial HTML/JSON |
GLM |
Heredoc failure, switched to curl |
774 KB raw HTML, not Cascade output |
GPT |
Claimed curl bypass;file never saved |
Metrics reported, referenced without verification path |
Opus |
Sequential heredoc appends in 6 batches |
1.05 KB stub; wrong directory + filename, user canceled x3 |
SWE |
Sequential heredoc appends, Python batching |
Partial file; user canceled x3 |
Agents acknowledged how tedious writing raw output would be, but not one claimed a write ceiling. Instead, each entered a
loop of strategy-switching, treating the failure as a solvable engineering problem rather than a constraint of the environment.
SWE-1.6 and Claude Opus 4.7 both identified token-cost awareness mid-task,
“this verbatim approach requires piping ~200–400 KB of raw text through shell commands, which is very token-expensive”,
but neither made an early exit. They identified symptoms without diagnosing the condition. Agents that produced files did so
by abandoning the Cascade pipeline. GLM-5.1 and Gemini 3.1 saved raw HTML via curl, essentially writing content the prompt
didn’t request; the verification script can’t meaningfully evaluate non-Cascade-specific behavior. GPT-5.3-Codex claimed to save
a file and didn’t. SWE and Opus produced stubs too small to verify.
Agents ruminated on the prompt language “EXACTLY as you received it” - does that mean Cascade’s chunk index with metadata
wrappers, already processed, or something pre-processed? Opus attempted to clarify mid-task and asked questions while others choose
interpretation and process-spin, similar to the silent-resolution pattern described in
Agent as Unreliable Methodology Validator.
As agents seemingly hit a write ceiling, they didn’t reason a way out, but drifted. Gemini read the README.md, the verification
script multiple times, and re-read the prompt, apparently trying to re-derive the task from project context. SWE reasoned through
its own chunk history to reconstruct a write strategy. Opus asked a clarifying question, received an answer, then got stuck again
on the same heredoc problem. Gemini tried npx. None of these strategies addressed the actual constraint. Like that described in
Agentic Inaction, Agentic Task Drift, Token Overflow,
and File Persistence Failures: agents that recognize an obstacle express uncertainty, attempt adjacent
actions, and produced confident-looking output without naming the obstacle as a blocker or asking whether the task is achievable as stated.
Methodology Implication
The read-write asymmetry is a restriction only the raw track could uncover. It reflects a structural mismatch between what Cascade’s
view_content_chunk produces at scale and what shell tooling can write back out. A writing ceiling introduces new layers to question
path compliance and self-reporting fidelity. While agents claim to have received all 60 chunks, chat output alone can’t complete
verification. Prompts specifying a target format may produce different results, but this testing framework is about capturing default
web fetch behavior. Eventhough the output variety make for difficult hypotheses assessment and pokes holes in the verification script
metrics, the variety is the finding, speaking to the challenges of combining qualitative and quantitative testing approaches.
Agent Ecosystem Testing