Agent Ecosystem Testing

Key Findings for Cascade’s Web Search Behavior, Raw


Test Workflow

  1. Run python web_search_testing_framework.py --test {test ID} --track raw
  2. Review terminal output
  3. Copy the provided prompt instructing agent to retrieve the URL and return content exactly as received, saving output to results/raw/raw_output_{test_ID}.txt
  4. Open a new Cascade session in Windsurf, paste the prompt into the chat window
  5. Approve web fetch calls and terminal commands; cancel if any run loops hang
  6. Run the verification script against the saved file; capture path compliance, file size,
    checksum, and truncation indicators
  7. Log structured metadata as described in framework-reference.md
  8. Ensure log results saved to /results/cascade-raw/results.csv

Raw output file presence, path compliance, and content fidelity tracked. Claiming a save without writing a file, referencing another agent’s file, or generating structurally accurate but semantically unmeaningful content all describe distinct failure modes; analysis in Friction: Raw.


Platform Limit Summary

Limit Observed
Hard Character Limit None detected: output sizes ranged from 275 to 56,256,891 chars;
ceilings agent-imposed and/or write-stage failures, not explicitly
platform-imposed byte limits
Hard
Token
Limit
None detected: token counts ranged from 52 to 12,782,469;
BL-3’s SWE error message model's generation exceeded the maximum output token limit first to suggest a write ceiling
Write Strategy Capability doesn’t predict output quality, but agent reasoning:
- pipeline acceptance runs cluster within narrow size band per URL
- deliberate elision - Opus only agent to ask questions mid-session
- curl bypass - files pass verification without prose
- silent failure - false completions, reuse, environment-degrading output
Content Selection Behavior Two-stage chunked retrieval: mirrors interpreted, explicit tracks - most agents used read_url_contentview_content_chunk; SC-2’s SWE called search_web once as a fallback in repsonse to redirect, found URL but didn’t return any content
Truncation Pattern Write-stage asymmetry: view_content_chunk retrieval reliable across agents, chunk counts; most failure modes write-related: Python heredoc errors, token ceiling and/or under-reporting, false completions, file reuse
Redirect Chains Size-influenced, behavior-dependent: all agents follow EC-3 5-hop redirect chain; SC-2 single cross-domain redirect caused read_url_content halt, error message referenced destination
Auto-pagination Spotty, agent-dependent: full retrieval common, but not guaranteed, even at low chunk counts; wide variance by model family; no agent paginated meaningfully at SC-2’s 1,026-chunk corpus; see Agentic Pagination Depth
curl Bypass Consistent failure: agents that correctly diagnose Cascade pipeline returns Markdown-ish, not raw HTML, often switch to curl; output files architecturally correct, but contain shells without prose
False Completion Claims Distinct failure mode: SWE runs of BL-1, OP-1; GPT-5.3-Codex runs of
BL-3, SC-3; Gemini run of EC-6; agents reported metrics of saved files that were never written
Cross-Agent File Reuse Confirmed via MD5 checksum: BL-2, BL-3, OP-1, EC-6 - once a plausible file exists in the workspace, agents may satisfy persistence requirement by reference rather than by writing
Path Compliance Agent-dependent: prompt instructs saving to raw/ which doesn’t exist;
BL-2’s GLM created it, later agents referenced cascade-raw/ and/or failed to save; cross-agent file visibility suggests worktree state is shared across Hybrid Arena slots, not isolated
URL Fragment Targeting Behavioral, not architectural: chunk index exposes headers, fragment targets; OP-1’s Grok-3 only agent to use it for navigation; 8 of 10 defaulted to full-doc retrieval; “EXACTLY as received” prompt may suppress targeting, making full retrieval seem like the safer interpretation

Results Details

   
Agent Selector Hybrid Arena - 5 slots per run;
OP-1 includes two arena rounds
Agents Observed Claude Opus 4.7, Claude Sonnet 4.6, Gemini 3.1,
GLM-5.1, GPT-5.3-Codex, GPT-5.4, GPT-5.5,
Kimi K2.6, Minimax M2.5, SWE-1.6, xAI Grok-3
Total Runs 66
Distinct URLs 11
Input Size Range estimation, rendered: ~2 KB - 256 KB
pipeline output, depending on retrieval method: 275 B - 56 MB
Truncation Events explicitly reported 5 / 66
chunked-architecture often acknowledged as lossy by design
Average Output Size 1,129,230 chars
Output Size Range 275 - 56,256,891 chars
Average Token Count 266,105 tokens
Token Count Range 52 - 12,782,469 tokens
Approval-gated Fetch 56 / 66 runs prompted for approval
Auto-pagination 48 runs
Failures - BL-1 Gemini task drift, token overflow
- EC-6 Gemini retrieval theater
- OP-1 most agents don’t isolate target section
- OP-4 retrieval success, but no clean output
- SC-2 redirect halt
URL Fragment Handling - OP-1’s Grok only agent to intentionally target #History
- Minimax analyzed #History incidentally via sampling
- 8 defaulted to full-doc retrieval

Agentic Pagination Depth

As observed in the interpreted and explicit tracks, agents consistently use read_url_content to fetch URLs, but whether they proceed to exhaust view_content_chunk varies substantially by agent, chunk count, and exclusively on the raw track - how they strategize the write task. Chunks-analyzed remains a primary behavioral variable in this dataset.

Full pagination appears more consistently throughout the raw track, suggesting the write task influences reason to retrieve each chunk. Document size and structure still have an impact, as OP-1, OP-4, and SC-2 produce the widest variance. SC-2 confirms abandonment is universal at 1,026 chunks regardless of agent family.

Agentic Write Performance

While the pagination depth map shows claimed retrieval, what agents reported reading, the write outcome map shows verified output: what ended up on disk, and in what form. The two maps together reveal the gap. EC-6’s Gemini run reads as 29% pagination coverage, but doesn’t map to a file; a content diff checker and MD5 checksum match confirmed it was all retrieval theater.

Tests where pagination depth is high, but write outcomes are spotty - BL-3, OP-1,
OP-4, SC-3, are where the read-write asymmetry is most visible. EC-3 is the only test with a clean success sweep, likely because the URL content didn’t require chunking at all. While EC-6 and SC-4 appeared to produce accurate output, many runs included false completions and file reuse.

Truncation Analysis

# Finding Tests Observed Conclusion
1 No fixed character or token ceiling detected at retrieval stage All tests Output sizes ranged from
275 B to 56 MB; no run hit a tool-imposed retriveal byte ceiling
Ceilings self-imposed and/or write-stage failures: deliberate elision or environment degradation; retrieval pipeline has no confirmed upper bound
2 Output token ceiling as a write-stage failure mechanism BL-3 SWE exceeded the agent’s output token limit explicitly mid-write, visible in the thought panel in real time; first direct observation of this ceiling across any Cascade track Ceiling real but write-related, not retrieval-related. Prior tracks inferred it; BL-3 observed it directly
3 Read-write asymmetry as dominant structural finding SC-3 SC-4 BL-3 OP-4 Most agents successfully retrieved all chunks in every test; write success was substantially lower across the same tests Retrieval via view_content_chunk reliable; obstacle is reassembling, persisting at scale
4 Auto-pagination confirmed,
but doesn’t predict output success
All tests 48 of 66 runs auto-paginated; 3 of 4 BL-1 auto-paginating runs still failed to produce a valid output file H5-yes across the dataset; behavior robust; doesn’t guarantee file persistence or content fidelity
5 Auto-pagination threshold
~1K chunks
SC-2 BL-2 EC-3 Most agents fully paginated at ≤ 50 chunks; no agent auto-paginated SC-2’s 1,026-chunk corpus Threshold exists, exact boundary is unconfirmed, but likely in 100 - 1K
chunks range
6 curl bypass produces semantically empty output BL-1 BL-3 SC-2 SC-3 EC-1 EC-6 Agents that correctly diagnose the pipeline as returning processed Markdown switch to curl; resulting files contain raw HTML or JS skeletons, architecturally correct, textually less meaningful Pipeline abandonment is dominant response to fidelity instinct, produce files that pass verification while missing target content
7 Cross-agent file reuse confirmed at checksum level BL-2 BL-3 EC-6 OP-1 Gemini, GLM produced output files with identical MD5 checksum; GLM ran earlier, wrote first; Gemini’s thought panel narrated retrieval while making no corresponding tool calls Path compliance independent variable; file presence at correct path doesn’t confirm independent retrieval
8 False completion claims as a distinct failure mode BL-1 BL-3 EC-6 SC-3 OP-1 Gemini, GPT-5.3-Codex, SWE reported metrics, file paths for content that was never written Confident assertions without uncertainty signal structurally different from spirals, early stops, but all three failure modes produce same outcome: no valid output file
9 Redirect halt behavior is confirmed as server-side, not tool-layer rewriting SC-2 Three agents successfully called read_url_content a second time against the redirect destination surfaced in the error payload, received valid chunked responses, not silent pre-network URL substitution read_url_content makes network call, receives redirect, halts rather than automatically follows; destination
is actionable via
follow-up call
10 Chunking pipeline size threshold EC-3 5-hop redirect chain returned ~366 B JSON inline via read_url_content alone; view_content_chunk not
called in any run
Small payloads return inline without triggering the
two-fetch pipeline
11 URL fragment targeting is behavioral, not architectural OP-1 8 of 10 agents retrieved all 92 chunks rather than targeting #History at chunk position 17; chunk index exposes the section header; Grok only agent to have used it for navigation Fragment-targeting achievable, but absent by default; agents attending to output completeness may prioritize full-doc collection
12 Prompt size estimates act as a confound for fidelity-sensitive agents BL-1, OP-4 BL-1 ~85 KB prompt estimation architecturally unreachable; Cascade returns ~8–32 KB of filtered Markdown; curl returns
~508 KB of raw HTML; some agents spiraled or truncated trying to reach the target
Prompt estimation became irresolvable constraint rather than a verification guide
13 “EXACTLY as received” underspecified, resolved silently BL-1 OP-4 SC-3 SC-4 Most agents interpreted output format as chunk index, metadata wrappers, raw HTTP response via curl, or semantic Markdown without flagging ambiguity or asking for clarification Instruction underspecification is reasoned-around across model families; only Opus identified the tradeoff in chat while strategizing a write plan
14 search_web not invoked as a retrieval mechanism Most tests Across 66 runs only SC-2’s SWE called search_web after retrieval failure, which only returned a URL, not content H4 untested across raw track; URL provision alone doesn’t trigger search_web

Perception Gap

The write outcome map is the only verified signal in this dataset. Agent self-report, output size, and path compliance are somewhat insufficient to distinguish genuine retrieval from curl bypass, deliberate elision, or retrieval theater without cross-agent checksum comparison and thought panel inspection.

Test Expected Received Delivery Ratio Agent Characterization
EC-6
Raw Markdown
~60 KB ~96 KB
3 agents, independent writes
~100% “Complete, chunk assembly variation within ±858 chars; elision markers are source false positives”
SC-4 Markdown Guide ~30 KB Sonnet
30.44 KB
Minimax 32.33 KB
~100% “Complete, breadcrumb heading injection at chunk boundaries inflates Minimax output; 6 elision markers present, but may be tool-layer assembly artifacts”
EC-3 Redirect JSON ~2 KB 366 B
identical output
~100% “Complete, 5-hop redirect chain followed cleanly; unique X-Amzn-Trace-Id per run confirms independent live requests”
SC-1 Gemini API Docs ~40 KB 38–44 KB chunk cluster
10.25 KB via direct fetch
chunk ~97%
direct ~60%
“Chunk cluster structurally identical across agents; direct fetch cleaner, but loses code blocks, navigation structure”
SC-3 Wikipedia ~100 KB SWE
69.5 KB pipeline
GLM/Gemini
275–774 KB via curl
pipeline ~68%; curl ~270–760% “Pipeline converts HTML tables to plain text lists, stripping column-row structure entirely; 255 table rows confirmed in raw HTML, 0 preserved in any Cascade-native output”
EC-1 Gemini API SPA ~100 KB SWE, Opus ~33–35 KB pipeline; GPT/Gemini ~118 KB
via curl
pipeline ~32–34%; curl ~115% “JavaScript SPA handled by Cascade pre-processing layer; SWE, Opus extracted semantic content, code blocks, agent descriptions; curl returned raw HTML skeleton regardless of agent”
BL-1 MongoDB Docs ~85 KB Opus
~8 KB
GLM
~32 KB
~9–38% “Pipeline output is 8–32 KB of filtered Markdown; raw HTML is
508 KB; no tool produces
estimated size”
BL-3 Tutorial ~250 KB Opus
~7.4 KB
GLM
~468 KB
~3% pipeline
~180% via curl
“Pipeline abandoned for curl; curl output Gatsby/React skeleton, no tutorial body content”
SC-2 Anthropic Docs ~80 KB Kimi
53.65 MB
Full docs corpus “Scale outlier; VS Code tokenization, highlighting, scroll disabled on open; file exists, environment degraded”