Friction: this note describes roadblocks while refining testing methodology
Topic Guide - Raw Track
- Agentic Task Drift, Token Overflow
- File Persistence Failures
read_url_contentRedirect Halt Behavior
Agentic Task Drift, Token Overflow
Gemini 3.1’s BL-1 run began on track. It analyzed all chunks from the URL, but once it recognized that
the pipeline returns a processed response and not a raw one, it exceeded the session token limit trying to
correct that, and ultimately failed to generate an output file for verification. Gemini used curl to fetch the raw
source, received 508 KB, significantly larger than the prompt’s ~85 KB estimate, then kept exploring alternative methods
to reconcile the size discrepancy. Alongside the size mismatch, the thought panel displayed content resembling a chunk
index with empty summaries, suggesting that the absence of navigational signal also contributed to this overcorrection*.
Having exhausted tool-based solutions, the agent treated adjacent codebase artifacts as methodology documentation, a
reasonable inference in a research project, but also, incorrect here. The Copilot framework’s raw output files
are just that: outputs, not specifications. The sequence below is reconstructed from thought panel snapshots, but the
loop count is unknown; steps likely repeated:
| Step | Behavior | Detail |
|---|---|---|
| 1 | Successful Retrieval | Reads all 54 chunks via read_url_content, view_content_chunk; summaries empty |
| 2 | Diagnoses Correctly | Recognizes pipeline returns processed Markdown, not raw HTML; switches to curl |
| 3 | Acknowledges Size Mismatch | curl returns 508 KB; prompt ~85KB; no available toolproduces expected size |
| 4 | Manual Intervention | User cancels stuck terminal commands; agent ruminates on canceled commands, chunk index |
| 5 | False Block | Claims output file already exists, but incorrect |
| 6 | Attempts Re-retrieval |
“Searched web” without claiming search_web; re-analyzes chunks |
| 7 | Probescurl |
Tries curl with varied headers, flags includingAccept: application/json |
| 8 | Searches MCP Cache |
Investigates whether read_url_content writes a local cache; searches ~/.windsurf for stored page content |
| 9 | Codebase Drift |
Locates different testing framework artifacts; copilot-web-content-retrieval/results/raw/raw_output_EC-6_run_3.txt |
| 10 | Misreads Artifact | Reads EC-6 output as methodology guidance;attempts npx afdocs; command canceled |
| 11 | Prohibited Tool Use | Examines web_search_verify_raw_results.py despiteinstructions restricting use |
| 12 | Pivots to write_to_file |
Considers assembling chunks via write_to_file; considers if ~21,000 tokens exceeds Cascade’s limit |
| 13 | Searches System | Inspects /User/History, state.vscdb, /tmp, Windsurf.logfor cached raw content |
| 14 | Mines Log |
Finds previous response in Windsurf.log; attempts toextract leafygreen-ui segment |
| 15 | Loses Context |
Can no longer locate original user prompt; speculates instructions truncated |
| 16 | Exceeds Token Limit |
Aborts output generation mid-run |
| 17 | Generates Report | Apologizes for CSS bloat; asks how to proceed |
Methodology Implication
The prompt’s size estimation may act as a confound in this track. If no available tool produces that size, agents with output-fidelity monitoring may spiral rather than approximate. Consider whether the size expectation belongs in the prompt at all, or only in post-hoc analysis.
*Empty summaries’ impact on pagination explored in Friction: Interpreted
File Persistence Failures
Agents struggled to create files and save them during BL-2 runs. The prompt explicitly required saving output to
results/raw/raw_output_BL-2.txt. Only GLM-5.1 and xAI Grok-3 wrote standalone project files to the correct path.
Gemini 3.1, SWE-1.6, and Kimi K2.6 each produced output that appeared in the chat window with a file reference,
but it wasn’t persisted as a discrete project artifact. Most runs required manual intervention to product a verifiable
file in the face of chat-window artifact substituion, cross-agent file reuse, and silent content truncation.
SC-2 runs displayed a shift in this pattern from directory ambiguity to scale-driven abandonment: four of six agents
bypassed the Cascade pipeline entirely via curl, producing files that were either not persisted as project artifacts
or grew to sizes that degraded the development environment itself. Kimi’s output with the full llms-full.txt corpus
at 53.65 MB caused VS Code to disable tokenization, syntax highlighting, and scroll features for the file. The file existed,
but was effectively unworkable as a project artifact.
A file being present at the correct path isn’t sufficient evidence of a successful retrieval. GLM’s SC-2 output was
structured agent analysis rather than raw content; Claude Sonnet 4.6’s was a chunk index with a single header. Both passed
path verification while containing no target page content.
| Agent | BL-2 |
SC-2 |
Results |
|---|---|---|---|
Gemini |
Chat only | Chat only | curl output; manual copy required both runs |
GLM |
Yes | Yes | Saved; content: agent analysis, chunk index, not entirely raw retrieval |
Grok |
Yes | N/A | BL-2 only; wrote file, only captured2 of 3 chunks |
Kimi |
Chat only | Chat only | 53.65 MB; VS Code feature degradation on open |
Sonnet |
N/A | Yes | Correct path; content: agent analysis, chunk index, chunk position 0 summary |
SWE |
No | Yes | First run failed entirely; retry used curl;saved stipped HTML only |
Methodology Implication
The prompt directs agents to save output to raw/, which does’t exist; cascade-raw/ does. This ambiguity is intentional:
it tests whether agents reason about directory structure or resolve path instructions literally. GLM responded correctly
by creating raw/ as a new directory. Later agents diverged; some wrote into cascade-raw/ treating it as equivalent, others
failed to persist a file at all. Cross-agent file reuse, SWE pointing to Gemini’s output, suggests that once a plausible
file exists in the workspace, some agents will satisfy the persistence requirement by reference rather than by writing. The
prompt ambiguity is retained as a test variable for subsequent runs for observing path compliance and content fidelity.
read_url_content Redirect Halt Behavior
The interpreted track and
explicit track both documented that no agent received the target content
from https://docs.anthropic.com/en/api/messages, and left open whether the cause was tool-layer URL rewriting or a server-side redirect. SC-2 runs on the raw track provide additional perspective.
Across six raw track agents, the redirect destination https://platform.claude.com/docs/llms-full.txt appeared consistently in the
error payload with enough fidelity that three agents, GLM-5.1, Kimi K2.6, and Claude Sonnet 4.6 successfully called
read_url_content a second time against the redirect target and received valid chunked responses. This pattern is inconsistent
with silent pre-network URL substitution: if the tool were rewriting before the request was made, the redirect destination wouldn’t
be actionable through a follow-up call. The more consistent explanation is that read_url_content makes the network call, receives
a server-side redirect, identifies the destination in the error response, and halts rather than following automatically. Agent
interpretation of this information diverged:
| Agent | Response |
|---|---|
Gemini |
Bypassed pipeline entirely via curl |
GLM |
Followed redirect via second read_url_content call;spent most time trying to find original target |
Kimi |
Followed redirect, then bypassed via curl for full corpus |
Sonnet |
Followed redirect, read chunk index, first position only |
SWE |
Hybrid Arena attempt; treated as terminal; attempted search_web fallback |
SWE |
Single retry; Bypassed via curl, but called it tool malfunction |
SWE’s first run remains notable for its fallback strategy and root cause diagnosis. It wasn’t the only agent
to use search_web. GLM called it on the explicit track as a verification attempt, but it also didn’t return any
usable results. SWE is the only agent to explicitly characterize the behavior as a tool-level bug on the raw track.
That diagnosis was reasonable given the absence of HTTP status codes in the agent’s visible context, but the raw track’s
successful follow-up calls suggest the mechanism is a redirect halt, not URL rewriting. Whether the redirect halt
originates from Cascade or Anthropic’s server remains unconfirmed without HTTP-level instrumentation.
Agent Ecosystem Testing