Agent Ecosystem Testing

Friction Note: Roadblocks While Refining Methodology


Agentic Task Drift, Token Overflow

Gemini 3.1’s BL-1 run began on track. It analyzed all chunks from the URL, but once it recognized that the pipeline returns a processed response and not a raw one, it exceeded the session token limit trying to correct that, and ultimately failed to generate an output file for verification. Gemini used curl to fetch the raw source, received 508 KB, significantly larger than the prompt’s ~85 KB estimate, then kept exploring alternative methods to reconcile the size discrepancy. Alongside the size mismatch, the thought panel displayed content resembling a chunk index with empty summaries, suggesting that the absence of navigational signal also contributed to this overcorrection*. Having exhausted tool-based solutions, the agent treated adjacent codebase artifacts as methodology documentation, a reasonable inference in a research project, but also, incorrect here. The Copilot framework’s raw output files are just that: outputs, not specifications. The sequence below reconstructs thought panel snapshots, but the loop count is unknown, and steps likely repeated:

Step Behavior Detail
1 Successful Retrieval Reads all 54 chunks via read_url_content, view_content_chunk; summaries empty
2 Diagnoses Correctly Recognizes pipeline returns processed Markdown, not raw HTML; switches to curl
3 Acknowledges Size Mismatch curl returns 508 KB; prompt ~85KB; no available tool
produces expected size
4 Manual Intervention User cancels stuck terminal commands; agent ruminates on canceled commands, chunk index
5 False Block Claims output file already exists, but incorrect
6 Attempts
Re-retrieval
“Searched web” without claiming search_web; re-analyzes chunks
7 Probes
curl
Tries curl with varied headers, flags including
Accept: application/json
8 Searches
MCP Cache
Investigates whether read_url_content writes a local cache; searches ~/.windsurf for stored page content
9 Codebase
Drift
Locates different testing framework artifacts;
copilot-web-content-retrieval/results/raw/raw_output_EC-6_run_3.txt
10 Misreads Artifact Reads EC-6 output as methodology guidance;
attempts npx afdocs; command canceled
11 Prohibited Tool Use Examines web_search_verify_raw_results.py despite
instructions restricting use
12 Pivots to write_to_file Considers assembling chunks via write_to_file; considers if ~21,000 tokens exceeds Cascade’s limit
13 Searches System Inspects /User/History, state.vscdb, /tmp, Windsurf.log
for cached raw content
14 Mines
Log
Finds previous response in Windsurf.log; attempts to
extract leafygreen-ui segment
15 Loses
Context
Can no longer locate original user prompt;
speculates instructions truncated
16 Exceeds
Token Limit
Aborts output generation mid-run
17 Generates Report Apologizes for CSS bloat; asks how to proceed

Methodology Implication

The prompt’s size estimation may act as a confound in this track. If no available tool produces that size, agents with output-fidelity monitoring may spiral rather than approximate. Consider whether the size expectation belongs in the prompt at all, or only in post-hoc analysis.

*Empty summaries’ impact on pagination explored in Friction: Interpreted


Context Window Reporting, Compaction Artifacts

Context window percentages are logged for every raw track run, but at least one run, SC-3 using SWE-1.6, shows the counter appearing to reset or compress mid-session. The notes read: “context window metrics change/compress/restart.” If the counter resets after a tool call batch or at some internal threshold, a 13% reading and a 98% reading may not be measuring the same thing across runs.

This matters most for any effort-to-outcome analysis. EC-6’s Gemini 3.1 run at 3% context with a confirmed reused file and SC-3’s Claude Opus 4.7 run at 98% context with a 1.05 KB stub would be striking side by side, but only if both percentages reflect the same denominator. The compaction behavior means they may not.

results.csv retains the data and the pattern is visible in the pagination and write outcome maps without requiring the percentages directly. A standalone effort-to-outcome visualization would require either a Windsurf update that stabilizes context window tracking, or instrumentation that captures token spend independently of what the agent reports. Until then, treat context window percentage as directionally suggestive rather than analytically reliable, and noted per run rather than aggregated.

Methodology Implication

This is primarily a data visualization problem rather than a core research finding. The behavioral stories including retrieval theater, false completion claims, and write ceiling failures, are legible without it. Context window percentage would add resolution to those stories, not change them.


Cross-Agent File Reuse, Verification Limits

The verification script defines the raw track. If an agent claims to have retrieved and analyzed content, this script intends to check path compliance, file size, checksum, and truncation indicators against what’s actually on disk, but this only works if agents write files.

While agents never directly admit it, three of five BL-3 runs reference an existing file rather than writing a new one. Once a somewhat-plausible file exists at a similar path, if it’s in the prompt-specified directory with the prompt-specified name doesn’t seem to matter, subsequent agents satisfy the persistence requirement with chat paths described as newly generated files, but point to artifacts of earlier runs. The script then verifies an earlier agent’s file, not the current agent’s retrieval. The agent can then claim another agent’s calculations as their own, draining their own analysis of meaning. But when agents do write raw output files, they tend to produce content that passes path and size verification while containing no semantically valuable text. While the script can confirm a file exists and is structurally intact, it can’t confirm that the file accurately represents the agent’s retrieval behavior in that run.

Is there any value in agent metrics or self-reported methodology if it’s not based on genuine calculations and analysis?

This consistent failure to persist raw output files is unique to Cascade, possibly due to the Hybrid Arena setting, which allows for five agents to run sequentially and/or simultaneously. While Cascade claims session isolation, it’s less plausible with each test run. The lack of output files reframes what this track is testing. Cascade’s chunking pipeline processes the response before the agents sees it without a direct path to raw HTML. Agents often recognize this and use over-half of their context window exploring alternatives, use curl, which then only returns a Gatsby and/or React skeleton rather than any tutorial text. BL-3 functions less as a retrieval benchmark and more as negative testing: presenting a tool with mismatched inputs and observing what agents do when success is structurally unavailable. This behavioral data in which agents disclose limitations, possibly fabricate completion, and silently reuse existing files is the finding, not the raw output files or metrics.

EC-6 provides the sharpest confirmation of cross-agent file reuse in the dataset. Gemini 3.1 and GLM-5.1 produced output files with an identical MD5 checksum and a spotless content diff, not similar assembly, but the same file. Gemini used only 3% of its context window, invoked approximately 12 terminal commands, and had a thought panel that narrated chunk-by-chunk retrieval while showing no corresponding tool calls. GLM ran earlier in the same arena session and wrote the file first via curl bypass. Gemini likely located the existing file in the workspace, referenced it as its own output, and performed retrieval theater rather than disclosing what it had found.

Methodology Implication

The verification script checks path compliance, file size, checksum, and truncation indicators, but it runs after the arena completes and compares against a single expected file. It can’t distinguish a file an agent wrote from a file an agent found. Per-agent checksums are already logged to results.csv; cross-agent comparison within the same arena run is the missing step. If two agents produce identical checksums on the same test, at least one didn’t perform independent retrieval; a check that currently requires manual post-hoc diffing rather than automated flagging.

This closes the lazy reuse case. An agent pointing to or copying an existing file without modification, but not the fabrication case, where an agent copies a file, computes its hash, and reports the result as its own. That pattern produces a different checksum from the source file and is indistinguishable from genuine retrieval through script-based verification alone. Detecting it may require observer-side tooling the agent can’t reach: filesystem timestamps recorded between arena slots, or version-controlled workspace state that captures file creation order independently of agent self-report.


File Persistence Failures

Agents struggled to create files and save them during BL-2 runs. The prompt explicitly required saving output to results/raw/raw_output_BL-2.txt. Only GLM-5.1 and xAI Grok-3 wrote standalone project files to the correct path. Gemini 3.1, SWE-1.6, and Kimi K2.6 each produced output that appeared in the chat window with a file reference, but it wasn’t persisted as a discrete project artifact. Most runs required manual intervention to product a verifiable file in the face of chat-window artifact substituion, cross-agent file reuse, and silent content truncation.

SC-2 runs displayed a shift in this pattern from directory ambiguity to scale-driven abandonment: four of six agents bypassed the Cascade pipeline entirely via curl, producing files that were either not persisted as project artifacts or grew to sizes that degraded the development environment itself. Kimi’s output with the full llms-full.txt corpus at 53.65 MB caused VS Code to disable tokenization, syntax highlighting, and scroll features for the file. The file existed, but was effectively unworkable as a project artifact.

A file being present at the correct path isn’t sufficient evidence of a successful retrieval. GLM’s SC-2’s output included structured agent analysis rather than raw content; Claude Sonnet 4.6’s was a chunk index with a single header. Both passed path verification while containing no target page content.

Agent BL-2 SC-2 Results
Gemini Chat only Chat only curl output; manual copy required both runs
GLM Yes Yes Saved; content: agent analysis, chunk index,
not entirely raw retrieval
Grok Yes N/A BL-2 only; wrote file, only captured
2 of 3 chunks
Kimi Chat only Chat only 53.65 MB; VS Code feature degradation on open
Sonnet N/A Yes Correct path; content: agent analysis, chunk index,
chunk position 0 summary
SWE No Yes First run failed entirely; retry used curl;
saved stipped HTML only

Agents analyzed most of OP-4’s chunks but seemed unable to concatenate and save them. Kimi constantly hit context deadline errors which ended with model provider unreachable and no report at all. Grok explicitly cited tool “response format limitations,” produced a one-line placeholder text file, and asked for an alternative method to handle large text data. GPT-5.4, Minimax M2.5 and Sonnet created raw output files, but were mostly CSS and navigation boilerplate, not tutorial text; passing path verification while missing the target content.

Methodology Implication

The prompt directs agents to save output to raw/, which does’t exist; cascade-raw/ does. This ambiguity is intentional: it tests whether agents reason about directory structure or resolve path instructions literally. GLM responded correctly by creating raw/ as a new directory. Later agents diverged; some wrote into cascade-raw/ treating it as equivalent, others failed to persist a file at all. Cross-agent file reuse, SWE pointing to Gemini’s output, suggests that once a plausible file exists in the workspace, some agents will satisfy the persistence requirement by reference rather than by writing. Retain the prompt ambiguity as a test variable for subsequent runs for observing path compliance and content fidelity.

SC-3 file persistence failures discussed in Write Ceiling, Output Fidelity


read_url_content Redirect Halt Behavior

The interpreted track and explicit track both documented that no agent received the target content from https://docs.anthropic.com/en/api/messages, and left open whether the cause was tool-layer URL rewriting or a server-side redirect. SC-2 runs on the raw track provide additional perspective.

Across six raw track agents, the redirect destination https://platform.claude.com/docs/llms-full.txt appeared consistently in the error payload with enough fidelity that three agents, GLM-5.1, Kimi K2.6, and Claude Sonnet 4.6 successfully called read_url_content a second time against the redirect target and received valid chunked responses. This pattern is inconsistent with silent pre-network URL substitution: if the tool were rewriting before the request formed, the redirect destination wouldn’t be actionable through a follow-up call. The more consistent explanation is that read_url_content makes the network call, receives a server-side redirect, identifies the destination in the error response, and halts rather than following automatically. Agent interpretation of this information diverged:

Agent Response
Gemini Bypassed pipeline entirely via curl
GLM Followed redirect via second read_url_content call;
spent most time trying to find original target
Kimi Followed redirect, then bypassed via curl for full corpus
Sonnet Followed redirect, read chunk index, first position only
SWE Hybrid Arena attempt; treated as terminal; attempted search_web fallback
SWE Single retry; Bypassed via curl, but called it tool malfunction

SWE’s first run remains notable for its fallback strategy and root cause diagnosis. It wasn’t the only agent to use search_web. GLM called it on the explicit track as a verification attempt, but it also didn’t return any usable results. SWE is the only agent to explicitly characterize the behavior as a tool-level bug on the raw track. That diagnosis was reasonable given the absence of HTTP status codes in the agent’s visible context, but the raw track’s successful follow-up calls suggest the mechanism is a redirect halt, not URL rewriting. Whether the redirect halt originates from Cascade or Anthropic’s server remains unconfirmed without HTTP-level instrumentation.


URL Fragment Targeting

OP-1 tests how agents handle URL fragments and whether they navigate to a target, the #History section of a Wikipedia page. SWE-1.5 successfully isolated the target section on the interpreted track, suggesting fragment-targeting is behavioral rather than architectural. In the raw track’s first arena OP-1 run, all agents defaulted to full-document retrieval and didn’t acknowledge the target section:

Agent Chunks
Analyzed
Context
Window
Fragment
Targeted?
File
Created?
Gemini 54 ~1%
15K/1M
No No
GLM 92 ~61%
123K/200K
No Yes, chunk index,
HTML shell only
GPT ~10 ~9%
38K/400K
No No, terminal error
Opus 92 ~17%
173K/1M
No No, terminal error
SWE 92 ~28%
57K/200K
No No, chat headings only

In the second arena run, two agents produces output files with the targeted section:

Agent Chunks
Analyzed
Context
Window
Fragment
Targeted?
File
Created?
GPT 77 ~78%
213K/272K
No Yes, chunk index,
metadata only
Grok 2 ~17%
22K/131K
Yes Yes, extracted
#History content
Kimi curl
bypass
~9%
23K/262K
No No - claimed,
not persisted
Minimax ~5 ~13%
27K/205K
Incidentally Yes, includes
#History content
Sonnet 92 ~56%
auto-opted out
No No, handed off
mid-run

Across both runs, secondary failures diverged significantly by agent. GLM-5.1 spent over an hour in a batch-append loop, attempting to write chunk content to the output file in segments; none of those six rewrites persisted, but remained as separate, scattered metadata. Gemini 3.1 completed without user permission, and when facing uncertainty about whether “exactly as received” referred to chunked pipeline output or raw HTML, began exploring curl as an alternative, the same spiral documented in Agentic Task Drift, Token Overflow. SWE-1.6 read all of the chunks, but closed the run in chat with a list of headings, ignoring most of the prompt. GPT-5.3-Codex read approximately ten chunks before a terminal error halted execution. Claude Opus 4.7 retrieved all of the chunks and attempted to write concatenated output in multiple steps, but the terminal command errored before any file persisted. Claude Sonnet 4.6 analyzed and re-analyzed all 92 chunks in parallel batches and remained in a refetching loop to edit their output file, consuming most of its context window before auto-opting out. GPT-5.4 paused mid-run to ask whether the user wanted Cascade pipeline output or a direct raw fetch, then successfully created a file after spending most of its context window on append errors, but the file contained chunk metadata rather than meaningful prose. Kimi K2.6 started without user permission, bypassed the Cascade pipeline with curl, claimed to have created a raw output file but didn’t, and constructed retroactive consent from tool output rather than pausing for explicit permission:

Actually, I should note that the read_url_content tool description says "The actual fetch will NOT
execute until the user approves it." But it seems to have already executed and returned chunk
metadata. So maybe the user already approved it?

Minimax M2.5 and xAI Grok-3 also started without user permission, but Grok was the only agent across both rounds to have produced an intentionally targeted output. Minimax sampled chunks that just happened to include the target section.

Methodology Implication

While the chunk index offers navigational structure, agents don’t consult it for fragment resolution by default. The prompt’s request to return content exactly as received may work against fragment-targeting, influencing agents to priortize output fidelity over a smaller retrieval scope. The 8-of-10 miss rate suggests the behavior is uncommon, but not rare enough to treat as a fluke. Minimax’s incidental hit is a separate finding: small-chunk sampling can accidentally recover the target section, which may inflate success metrics if output content passes other verification without examining the agent’s navigational reasoning.


Write Ceiling, Output Fidelity

Across SC-3 runs retrieval behavior was uniform. Agents analyzed all 60 chunks but struggled to report what they read with a raw output file. The prompt instructs “retrieve the content from this URL and return it EXACTLY as you received it” - seemingly creating a failure mode that no agent named or resolved cleanly. Each agent retrieved chunk content into context, then encountered on the write side: shell heredocs with special characters, \n escape sequences, and chunk metadata at a volume that caused terminal commands to hang, Python scripts to loop, and file writes to produce partial or empty output:

Agent Strategy Outcome
Gemini Wandered through project files,
tried npx, curl, Python
3 different artifacts,
276 KB partial HTML/JSON
GLM Heredoc failure,
switched to curl
774 KB raw HTML,
not Cascade output
GPT Claimed curl bypass;
file never saved
Metrics reported, referenced
without verification path
Opus Sequential heredoc appends
in 6 batches
1.05 KB stub; wrong directory
+ filename, user canceled x3
SWE Sequential heredoc appends,
Python batching
Partial file;
user canceled x3

Agents acknowledged how tedious writing raw output would be, but not one claimed a write ceiling. Instead, each entered a loop of strategy-switching, treating the failure as a solvable engineering problem rather than a constraint of the environment. SWE-1.6 and Claude Opus 4.7 both identified token-cost awareness mid-task, “this verbatim approach requires piping ~200–400 KB of raw text through shell commands, which is very token-expensive”, but neither made an early exit. They identified symptoms without diagnosing the condition. Agents that produced files did so by abandoning the Cascade pipeline. GLM-5.1 and Gemini 3.1 saved raw HTML via curl, essentially writing content the prompt didn’t request; the verification script can’t meaningfully evaluate non-Cascade-specific behavior. GPT-5.3-Codex claimed to save a file and didn’t. SWE and Opus produced stubs too small to verify.

Agents ruminated on the prompt language “EXACTLY as you received it” - does that mean Cascade’s chunk index with metadata wrappers, already processed, or something pre-processed? Opus attempted to clarify mid-task and asked questions while others choose interpretation and process-spin, similar to the silent-resolution pattern described in Agent as Unreliable Methodology Validator.

As agents seemingly hit a write ceiling, they didn’t reason a way out, but drifted. Gemini read the README.md, the verification script multiple times, and re-read the prompt, apparently trying to re-derive the task from project context. SWE reasoned through its own chunk history to reconstruct a write strategy. Opus asked a clarifying question, received an answer, then got stuck again on the same heredoc problem. Gemini tried npx. None of these strategies addressed the actual constraint. Like that described in Agentic Inaction, Agentic Task Drift, Token Overflow, and File Persistence Failures: agents that recognize an obstacle express uncertainty, attempt adjacent actions, and produced confident-looking output without naming the obstacle as a blocker or asking whether the task is achievable as stated.

Methodology Implication

The read-write asymmetry is a restriction only the raw track could uncover. It reflects a structural mismatch between what Cascade’s view_content_chunk produces at scale and what shell tooling can write back out. A writing ceiling introduces new layers to question path compliance and self-reporting fidelity. While agents claim to have received all 60 chunks, chat output alone can’t complete verification. Prompts specifying a target format may produce different results, but this testing framework is about capturing default web fetch behavior. Eventhough the output variety make for difficult hypotheses assessment and pokes holes in the verification script metrics, the variety is the finding, speaking to the challenges of combining qualitative and quantitative testing approaches.