Key Findings for Cascade’s Web Search Behavior, Raw
Test Workflow
- Run
python web_search_testing_framework.py --test {test ID} --track raw - Review terminal output
- Copy the provided prompt instructing agent to retrieve the URL and return
content exactly as received, saving output to
results/raw/raw_output_{test_ID}.txt - Open a new Cascade session in Windsurf, paste the prompt into the chat window
- Approve web fetch calls and terminal commands; cancel if any run loops hang
- Run the verification script against the saved file; capture path compliance,
file size,
checksum, and truncation indicators - Log structured metadata as described in
framework-reference.md - Ensure log results saved to
/results/cascade-raw/results.csv
Raw output file presence, path compliance, and content fidelity tracked. Claiming a save without writing a file, referencing another agent’s file, or generating structurally accurate but semantically unmeaningful content all describe distinct failure modes; analysis in Friction: Raw.
Platform Limit Summary
| Limit | Observed |
|---|---|
| Hard Character Limit | None detected: output sizes ranged from 275 to 56,256,891 chars; ceilings agent-imposed and/or write-stage failures, not explicitly platform-imposed byte limits |
| Hard Token Limit |
None detected: token counts ranged from 52 to 12,782,469;BL-3’s SWE error message model's generation exceeded the maximum output token limit first to suggest a write ceiling |
| Write Strategy | Capability doesn’t predict output quality, but agent reasoning: - pipeline acceptance runs cluster within narrow size band per URL - deliberate elision - Opus only agent to ask questions mid-session- curl bypass - files pass verification without prose- silent failure - false completions, reuse, environment-degrading output |
| Content Selection Behavior | Two-stage chunked retrieval: mirrors interpreted, explicit tracks - most agents used read_url_content → view_content_chunk; SC-2’s SWE called search_web once as a fallback in repsonse to redirect, found URL but didn’t return any content |
| Truncation Pattern | Write-stage asymmetry: view_content_chunk retrieval reliable across agents, chunk counts; most failure modes write-related: Python heredoc errors, token ceiling and/or under-reporting, false completions, file reuse |
| Redirect Chains | Size-influenced, behavior-dependent: all agents follow EC-3 5-hop redirect chain; SC-2 single cross-domain redirect caused read_url_content halt, error message referenced destination |
| Auto-pagination | Spotty, agent-dependent: full retrieval common, but not guaranteed, even at low chunk counts; wide variance by model family; no agent paginated meaningfully at SC-2’s 1,026-chunk corpus; see Agentic Pagination Depth |
curl Bypass |
Consistent failure: agents that correctly diagnose Cascade pipeline returns Markdown-ish, not raw HTML, often switch to curl; output files architecturally correct, but contain shells without prose |
| False Completion Claims | Distinct failure mode: SWE runs of BL-1, OP-1; GPT-5.3-Codex runs ofBL-3, SC-3; Gemini run of EC-6; agents reported metrics of saved files that were never written |
| Cross-Agent File Reuse | Confirmed via MD5 checksum: BL-2, BL-3, OP-1, EC-6 - once a plausible file exists in the workspace, agents may satisfy persistence requirement by reference rather than by writing |
| Path Compliance | Agent-dependent: prompt instructs saving to raw/ which doesn’t exist;BL-2’s GLM created it, later agents referenced cascade-raw/ and/or failed to save; cross-agent file visibility suggests worktree state is shared across Hybrid Arena slots, not isolated |
| URL Fragment Targeting | Behavioral, not architectural: chunk index exposes headers, fragment targets; OP-1’s Grok-3 only agent to use it for navigation; 8 of 10 defaulted to full-doc retrieval; “EXACTLY as received” prompt may suppress targeting, making full retrieval seem like the safer interpretation |
Results Details
| Agent Selector | Hybrid Arena - 5 slots per run;OP-1 includes two arena rounds |
| Agents Observed | Claude Opus 4.7, Claude Sonnet 4.6, Gemini 3.1,GLM-5.1, GPT-5.3-Codex, GPT-5.4, GPT-5.5,Kimi K2.6, Minimax M2.5, SWE-1.6, xAI Grok-3 |
| Total Runs | 66 |
| Distinct URLs | 11 |
| Input Size Range | estimation, rendered: ~2 KB - 256 KB pipeline output, depending on retrieval method: 275 B - 56 MB |
| Truncation Events | explicitly reported 5 / 66 chunked-architecture often acknowledged as lossy by design |
| Average Output Size | 1,129,230 chars |
| Output Size Range | 275 - 56,256,891 chars |
| Average Token Count | 266,105 tokens |
| Token Count Range | 52 - 12,782,469 tokens |
| Approval-gated Fetch | 56 / 66 runs prompted for approval |
| Auto-pagination | 48 runs |
| Failures | - BL-1 Gemini task drift, token overflow- EC-6 Gemini retrieval theater- OP-1 most agents don’t isolate target section- OP-4 retrieval success, but no clean output- SC-2 redirect halt |
| URL Fragment Handling | - OP-1’s Grok only agent to intentionally target #History- Minimax analyzed #History incidentally via sampling- 8 defaulted to full-doc retrieval |
Agentic Pagination Depth
As observed in the interpreted and explicit tracks, agents consistently use read_url_content to fetch URLs, but whether
they proceed to exhaust view_content_chunk varies substantially by agent, chunk count, and exclusively on the raw track -
how they strategize the write task. Chunks-analyzed remains a primary behavioral variable in this dataset.
Full pagination appears more consistently throughout the raw track, suggesting the write task influences reason to retrieve
each chunk. Document size and structure still have an impact, as OP-1, OP-4, and SC-2 produce the widest variance.
SC-2 confirms abandonment is universal at 1,026 chunks regardless of agent family.
Agentic Write Performance
While the pagination depth map shows claimed retrieval, what agents reported reading, the
write outcome map shows verified output: what ended up on disk, and in what form. The two maps together reveal the gap.
EC-6’s Gemini run reads as 29% pagination coverage, but doesn’t map to a file; a content diff checker and MD5 checksum
match confirmed it was all retrieval theater.
Tests where pagination depth is high, but write outcomes are spotty - BL-3, OP-1,OP-4, SC-3, are where the read-write
asymmetry is most visible. EC-3 is the only test with a clean success sweep, likely because the URL content didn’t require
chunking at all. While EC-6 and SC-4 appeared to produce accurate output, many runs included false completions and file reuse.
Truncation Analysis
| # | Finding | Tests | Observed | Conclusion |
|---|---|---|---|---|
| 1 | No fixed character or token ceiling detected at retrieval stage | All tests | Output sizes ranged from 275 B to 56 MB; no run hit a tool-imposed retriveal byte ceiling |
Ceilings self-imposed and/or write-stage failures: deliberate elision or environment degradation; retrieval pipeline has no confirmed upper bound |
| 2 | Output token ceiling as a write-stage failure mechanism | BL-3 |
SWE exceeded the agent’s output token limit explicitly mid-write, visible in the thought panel in real time; first direct observation of this ceiling across any Cascade track |
Ceiling real but write-related, not retrieval-related. Prior tracks inferred it; BL-3 observed it directly |
| 3 | Read-write asymmetry as dominant structural finding | SC-3 SC-4 BL-3 OP-4 |
Most agents successfully retrieved all chunks in every test; write success was substantially lower across the same tests | Retrieval via view_content_chunk reliable; obstacle is reassembling, persisting at scale |
| 4 | Auto-pagination confirmed, but doesn’t predict output success |
All tests | 48 of 66 runs auto-paginated; 3 of 4 BL-1 auto-paginating runs still failed to produce a valid output file |
H5-yes across the dataset; behavior robust; doesn’t guarantee file persistence or content fidelity |
| 5 | Auto-pagination threshold ~1K chunks |
SC-2 BL-2 EC-3 |
Most agents fully paginated at ≤ 50 chunks; no agent auto-paginated SC-2’s 1,026-chunk corpus |
Threshold exists, exact boundary is unconfirmed, but likely in 100 - 1K chunks range |
| 6 | curl bypass produces semantically empty output |
BL-1 BL-3 SC-2 SC-3 EC-1 EC-6 |
Agents that correctly diagnose the pipeline as returning processed Markdown switch to curl; resulting files contain raw HTML or JS skeletons, architecturally correct, textually less meaningful |
Pipeline abandonment is dominant response to fidelity instinct, produce files that pass verification while missing target content |
| 7 | Cross-agent file reuse confirmed at checksum level | BL-2 BL-3 EC-6 OP-1 |
Gemini, GLM produced output files with identical MD5 checksum; GLM ran earlier, wrote first; Gemini’s thought panel narrated retrieval while making no corresponding tool calls |
Path compliance independent variable; file presence at correct path doesn’t confirm independent retrieval |
| 8 | False completion claims as a distinct failure mode | BL-1 BL-3 EC-6 SC-3 OP-1 |
Gemini, GPT-5.3-Codex, SWE reported metrics, file paths for content that was never written |
Confident assertions without uncertainty signal structurally different from spirals, early stops, but all three failure modes produce same outcome: no valid output file |
| 9 | Redirect halt behavior is confirmed as server-side, not tool-layer rewriting | SC-2 |
Three agents successfully called read_url_content a second time against the redirect destination surfaced in the error payload, received valid chunked responses, not silent pre-network URL substitution |
read_url_content makes network call, receives redirect, halts rather than automatically follows; destinationis actionable via follow-up call |
| 10 | Chunking pipeline size threshold | EC-3 |
5-hop redirect chain returned ~366 B JSON inline via read_url_content alone; view_content_chunk notcalled in any run |
Small payloads return inline without triggering the two-fetch pipeline |
| 11 | URL fragment targeting is behavioral, not architectural | OP-1 |
8 of 10 agents retrieved all 92 chunks rather than targeting #History at chunk position 17; chunk index exposes the section header; Grok only agent to have used it for navigation |
Fragment-targeting achievable, but absent by default; agents attending to output completeness may prioritize full-doc collection |
| 12 | Prompt size estimates act as a confound for fidelity-sensitive agents | BL-1, OP-4 |
BL-1 ~85 KB prompt estimation architecturally unreachable; Cascade returns ~8–32 KB of filtered Markdown; curl returns~508 KB of raw HTML; some agents spiraled or truncated trying to reach the target |
Prompt estimation became irresolvable constraint rather than a verification guide |
| 13 | “EXACTLY as received” underspecified, resolved silently | BL-1 OP-4 SC-3 SC-4 |
Most agents interpreted output format as chunk index, metadata wrappers, raw HTTP response via curl, or semantic Markdown without flagging ambiguity or asking for clarification |
Instruction underspecification is reasoned-around across model families; only Opus identified the tradeoff in chat while strategizing a write plan |
| 14 | search_web not invoked as a retrieval mechanism |
Most tests | Across 66 runs only SC-2’s SWE called search_web after retrieval failure, which only returned a URL, not content |
H4 untested across raw track; URL provision alone doesn’t trigger search_web |
Perception Gap
The write outcome map is the only verified signal in this dataset. Agent self-report, output size, and path compliance are somewhat insufficient to distinguish genuine retrieval from
curlbypass, deliberate elision, or retrieval theater without cross-agent checksum comparison and thought panel inspection.
| Test | Expected | Received | Delivery Ratio | Agent Characterization |
|---|---|---|---|---|
EC-6Raw Markdown |
~60 KB | ~96 KB 3 agents, independent writes |
~100% | “Complete, chunk assembly variation within ±858 chars; elision markers are source false positives” |
SC-4 Markdown Guide |
~30 KB | Sonnet30.44 KB Minimax 32.33 KB |
~100% | “Complete, breadcrumb heading injection at chunk boundaries inflates Minimax output; 6 elision markers present, but may be tool-layer assembly artifacts” |
EC-3 Redirect JSON |
~2 KB | 366 B identical output |
~100% | “Complete, 5-hop redirect chain followed cleanly; unique X-Amzn-Trace-Id per run confirms independent live requests” |
SC-1 Gemini API Docs |
~40 KB | 38–44 KB chunk cluster 10.25 KB via direct fetch |
chunk ~97% direct ~60% |
“Chunk cluster structurally identical across agents; direct fetch cleaner, but loses code blocks, navigation structure” |
SC-3 Wikipedia |
~100 KB | SWE69.5 KB pipeline GLM/Gemini275–774 KB via curl |
pipeline ~68%; curl ~270–760% | “Pipeline converts HTML tables to plain text lists, stripping column-row structure entirely; 255 table rows confirmed in raw HTML, 0 preserved in any Cascade-native output” |
EC-1 Gemini API SPA |
~100 KB | SWE, Opus ~33–35 KB pipeline; GPT/Gemini ~118 KBvia curl |
pipeline ~32–34%; curl ~115% | “JavaScript SPA handled by Cascade pre-processing layer; SWE, Opus extracted semantic content, code blocks, agent descriptions; curl returned raw HTML skeleton regardless of agent” |
BL-1 MongoDB Docs |
~85 KB | Opus~8 KB GLM~32 KB |
~9–38% | “Pipeline output is 8–32 KB of filtered Markdown; raw HTML is 508 KB; no tool produces estimated size” |
BL-3 Tutorial |
~250 KB | Opus~7.4 KB GLM~468 KB |
~3% pipeline ~180% via curl |
“Pipeline abandoned for curl; curl output Gatsby/React skeleton, no tutorial body content” |
SC-2 Anthropic Docs |
~80 KB | Kimi53.65 MB |
Full docs corpus | “Scale outlier; VS Code tokenization, highlighting, scroll disabled on open; file exists, environment degraded” |
Agent Ecosystem Testing