Key Findings for Cascade’s Web Search Behavior - Raw

Test Workflow

Run python web_search_testing_framework.py --test {test ID} --track raw
Review terminal output
Copy the provided prompt instructing agent to retrieve the URL and return content exactly as received, saving output to results/raw/raw_output_{test_ID}.txt
Open a new Cascade session in Windsurf, paste the prompt into the chat window
Approve web fetch calls and terminal commands; cancel if any run loops hang
Run the verification script against the saved file; capture path compliance, file size,
checksum, and truncation indicators
Log structured metadata as described in framework-reference.md
Ensure log results saved to /results/cascade-raw/results.csv

Raw output file presence, path compliance, and content fidelity tracked. Claiming a save without writing a file, referencing another agent’s file, or generating structurally accurate but semantically unmeaningful content all describe distinct failure modes; analysis in Friction: Raw.

Platform Limit Summary

Limit	Observed
Hard Character Limit	None detected: output sizes ranged from 275 to 56,256,891 chars; ceilings agent-imposed and/or write-stage failures, not explicitly platform-imposed byte limits
Hard Token Limit	None detected: token counts ranged from 52 to 12,782,469; `BL-3`’s `SWE` error message `model's generation exceeded the maximum output token limit` first to suggest a write ceiling
Write Strategy	Capability doesn’t predict output quality, but agent reasoning: - pipeline acceptance runs cluster within narrow size band per URL - deliberate elision - `Opus` only agent to ask questions mid-session - `curl` bypass - files pass verification without prose - silent failure - false completions, reuse, environment-degrading output
Content Selection Behavior	Two-stage chunked retrieval: mirrors interpreted, explicit tracks - most agents used `read_url_content` → `view_content_chunk`; `SC-2`’s `SWE` called `search_web` once as a fallback in repsonse to redirect, found URL but didn’t return any content
Truncation Pattern	Write-stage asymmetry: `view_content_chunk` retrieval reliable across agents, chunk counts; most failure modes write-related: Python heredoc errors, token ceiling and/or under-reporting, false completions, file reuse
Redirect Chains	Size-influenced, behavior-dependent: all agents follow `EC-3` 5-hop redirect chain; `SC-2` single cross-domain redirect caused `read_url_content` halt, error message referenced destination
Auto-pagination	Spotty, agent-dependent: full retrieval common, but not guaranteed, even at low chunk counts; wide variance by model family; no agent paginated meaningfully at `SC-2`’s 1,026-chunk corpus; see Agentic Pagination Depth
`curl` Bypass	Consistent failure: agents that correctly diagnose Cascade pipeline returns Markdown-ish, not raw HTML, often switch to `curl`; output files architecturally correct, but contain shells without prose
False Completion Claims	Distinct failure mode: `SWE` runs of `BL-1`, `OP-1`; `GPT-5.3-Codex` runs of `BL-3`, `SC-3`; `Gemini` run of `EC-6`; agents reported metrics of saved files that were never written
Cross-Agent File Reuse	Confirmed via MD5 checksum: `BL-2`, `BL-3`, `OP-1`, `EC-6` - once a plausible file exists in the workspace, agents may satisfy persistence requirement by reference rather than by writing
Path Compliance	Agent-dependent: prompt instructs saving to `raw/` which doesn’t exist; `BL-2`’s `GLM` created it, later agents referenced `cascade-raw/` and/or failed to save; cross-agent file visibility suggests worktree state is shared across Hybrid Arena slots, not isolated
URL Fragment Targeting	Behavioral, not architectural: chunk index exposes headers, fragment targets; `OP-1`’s `Grok-3` only agent to use it for navigation; 8 of 10 defaulted to full-doc retrieval; “EXACTLY as received” prompt may suppress targeting, making full retrieval seem like the safer interpretation

Results Snapshot


Agent Selector	Hybrid Arena - 5 slots per run; `OP-1` includes two arena rounds
Agents Observed	`Claude Opus 4.7`, `Claude Sonnet 4.6`, `Gemini 3.1`, `GLM-5.1`, `GPT-5.3-Codex`, `GPT-5.4`, `GPT-5.5`, `Kimi K2.6`, `Minimax M2.5`, `SWE-1.6`, `xAI Grok-3`
Total Runs	66
Distinct URLs	11
Input Size Range	estimation, rendered: ~2 KB - 256 KB pipeline output, depending on retrieval method: 275 B - 56 MB
Truncation Events	explicitly reported 5 / 66 chunked-architecture often acknowledged as lossy by design
Average Output Size	1,129,230 chars
Output Size Range	275 - 56,256,891 chars
Average Token Count	266,105 tokens
Token Count Range	52 - 12,782,469 tokens
Approval-gated Fetch	56 / 66 runs prompted for approval
Auto-pagination	48 runs
Failures	- `BL-1` `Gemini` task drift, token overflow - `EC-6` `Gemini` retrieval theater - `OP-1` most agents don’t isolate target section - `OP-4` retrieval success, but no clean output - `SC-2` redirect halt
URL Fragment Handling	- `OP-1`’s `Grok` only agent to intentionally target `#History` - `Minimax` analyzed `#History` incidentally via sampling - 8 defaulted to full-doc retrieval

Agentic Pagination Depth

As observed in the interpreted and explicit tracks, agents consistently use read_url_content to fetch URLs, but whether they proceed to exhaust view_content_chunk varies substantially by agent, chunk count, and exclusively on the raw track - how they strategize the write task. Chunks-analyzed remains a primary behavioral variable in this dataset.

Full pagination appears more consistently throughout the raw track, suggesting the write task influences reason to retrieve each chunk. Document size and structure still have an impact, as OP-1, OP-4, and SC-2 produce the widest variance. SC-2 confirms abandonment is universal at 1,026 chunks regardless of agent family.

Agentic Write Performance

While the pagination depth map shows claimed retrieval, what agents reported reading, the write outcome map shows verified output: what ended up on disk, and in what form. The two maps together reveal the gap. EC-6’s Gemini run reads as 29% pagination coverage, but doesn’t map to a file; a content diff checker and MD5 checksum match confirmed it was all retrieval theater.

Tests where pagination depth is high, but write outcomes are spotty - BL-3, OP-1,
OP-4, SC-3, are where the read-write asymmetry is most visible. EC-3 is the only test with a clean success sweep, likely because the URL content didn’t require chunking at all. While EC-6 and SC-4 appeared to produce accurate output, many runs included false completions and file reuse.

Truncation Analysis

#	Finding	Tests	Observed	Conclusion
1	No fixed character or token ceiling detected at retrieval stage	All tests	Output sizes ranged from 275 B to 56 MB; no run hit a tool-imposed retriveal byte ceiling	Ceilings self-imposed and/or write-stage failures: deliberate elision or environment degradation; retrieval pipeline has no confirmed upper bound
2	Output token ceiling as a write-stage failure mechanism	`BL-3`	`SWE` exceeded the agent’s output token limit explicitly mid-write, visible in the thought panel in real time; first direct observation of this ceiling across any Cascade track	Ceiling real but write-related, not retrieval-related. Prior tracks inferred it; `BL-3` observed it directly
3	Read-write asymmetry as dominant structural finding	`SC-3` `SC-4` `BL-3` `OP-4`	Most agents successfully retrieved all chunks in every test; write success was substantially lower across the same tests	Retrieval via `view_content_chunk` reliable; obstacle is reassembling, persisting at scale
4	Auto-pagination confirmed, but doesn’t predict output success	All tests	48 of 66 runs auto-paginated; 3 of 4 `BL-1` auto-paginating runs still failed to produce a valid output file	`H5`-yes across the dataset; behavior robust; doesn’t guarantee file persistence or content fidelity
5	Auto-pagination threshold ~1K chunks	`SC-2` `BL-2` `EC-3`	Most agents fully paginated at ≤ 50 chunks; no agent auto-paginated `SC-2`’s 1,026-chunk corpus	Threshold exists, exact boundary is unconfirmed, but likely in 100 - 1K chunks range
6	`curl` bypass produces semantically empty output	`BL-1` `BL-3` `SC-2` `SC-3` `EC-1` `EC-6`	Agents that correctly diagnose the pipeline as returning processed Markdown switch to `curl`; resulting files contain raw HTML or JS skeletons, architecturally correct, textually less meaningful	Pipeline abandonment is dominant response to fidelity instinct, produce files that pass verification while missing target content
7	Cross-agent file reuse confirmed at checksum level	`BL-2` `BL-3` `EC-6` `OP-1`	`Gemini`, `GLM` produced output files with identical MD5 checksum; `GLM` ran earlier, wrote first; `Gemini`’s thought panel narrated retrieval while making no corresponding tool calls	Path compliance independent variable; file presence at correct path doesn’t confirm independent retrieval
8	False completion claims as a distinct failure mode	`BL-1` `BL-3` `EC-6` `SC-3` `OP-1`	`Gemini`, `GPT-5.3-Codex`, `SWE` reported metrics, file paths for content that was never written	Confident assertions without uncertainty signal structurally different from spirals, early stops, but all three failure modes produce same outcome: no valid output file
9	Redirect halt behavior is confirmed as server-side, not tool-layer rewriting	`SC-2`	Three agents successfully called `read_url_content` a second time against the redirect destination surfaced in the error payload, received valid chunked responses, not silent pre-network URL substitution	`read_url_content` makes network call, receives redirect, halts rather than automatically follows; destination is actionable via follow-up call
10	Chunking pipeline size threshold	`EC-3`	5-hop redirect chain returned ~366 B JSON inline via `read_url_content` alone; `view_content_chunk` not called in any run	Small payloads return inline without triggering the two-fetch pipeline
11	URL fragment targeting is behavioral, not architectural	`OP-1`	8 of 10 agents retrieved all 92 chunks rather than targeting `#History` at chunk position 17; chunk index exposes the section header; `Grok` only agent to have used it for navigation	Fragment-targeting achievable, but absent by default; agents attending to output completeness may prioritize full-doc collection
12	Prompt size estimates act as a confound for fidelity-sensitive agents	`BL-1`, `OP-4`	`BL-1` ~85 KB prompt estimation architecturally unreachable; Cascade returns ~8–32 KB of filtered Markdown; `curl` returns ~508 KB of raw HTML; some agents spiraled or truncated trying to reach the target	Prompt estimation became irresolvable constraint rather than a verification guide
13	“EXACTLY as received” underspecified, resolved silently	`BL-1` `OP-4` `SC-3` `SC-4`	Most agents interpreted output format as chunk index, metadata wrappers, raw HTTP response via `curl`, or semantic Markdown without flagging ambiguity or asking for clarification	Instruction underspecification is reasoned-around across model families; only `Opus` identified the tradeoff in chat while strategizing a write plan
14	`search_web` not invoked as a retrieval mechanism	Most tests	Across 66 runs only `SC-2`’s `SWE` called `search_web` after retrieval failure, which only returned a URL, not content	`H4` untested across raw track; URL provision alone doesn’t trigger `search_web`

Perception Gap

The write outcome map is the only verified signal in this dataset. Agent self-report, output size, and path compliance are somewhat insufficient to distinguish genuine retrieval from curl bypass, deliberate elision, or retrieval theater without cross-agent checksum comparison and thought panel inspection.

Test	Expected	Received	Delivery Ratio	Agent Characterization
`EC-6` Raw Markdown	~60 KB	~96 KB 3 agents, independent writes	~100%	“Complete, chunk assembly variation within ±858 chars; elision markers are source false positives”
`SC-4` Markdown Guide	~30 KB	`Sonnet` 30.44 KB `Minimax` 32.33 KB	~100%	“Complete, breadcrumb heading injection at chunk boundaries inflates Minimax output; 6 elision markers present, but may be tool-layer assembly artifacts”
`EC-3` Redirect JSON	~2 KB	366 B identical output	~100%	“Complete, 5-hop redirect chain followed cleanly; unique `X-Amzn-Trace-Id` per run confirms independent live requests”
`SC-1` Gemini API Docs	~40 KB	38–44 KB chunk cluster 10.25 KB via direct fetch	chunk ~97% direct ~60%	“Chunk cluster structurally identical across agents; direct fetch cleaner, but loses code blocks, navigation structure”
`SC-3` Wikipedia	~100 KB	`SWE` 69.5 KB pipeline `GLM`/`Gemini` 275–774 KB via `curl`	pipeline ~68%; curl ~270–760%	“Pipeline converts HTML tables to plain text lists, stripping column-row structure entirely; 255 table rows confirmed in raw HTML, 0 preserved in any Cascade-native output”
`EC-1` Gemini API SPA	~100 KB	`SWE`, `Opus` ~33–35 KB pipeline; `GPT`/`Gemini` ~118 KB via `curl`	pipeline ~32–34%; curl ~115%	“JavaScript SPA handled by Cascade pre-processing layer; `SWE`, `Opus` extracted semantic content, code blocks, agent descriptions; `curl` returned raw HTML skeleton regardless of agent”
`BL-1` MongoDB Docs	~85 KB	`Opus` ~8 KB `GLM` ~32 KB	~9–38%	“Pipeline output is 8–32 KB of filtered Markdown; raw HTML is 508 KB; no tool produces estimated size”
`BL-3` Tutorial	~250 KB	`Opus` ~7.4 KB `GLM` ~468 KB	~3% pipeline ~180% via `curl`	“Pipeline abandoned for `curl`; `curl` output Gatsby/React skeleton, no tutorial body content”
`SC-2` Anthropic Docs	~80 KB	`Kimi` 53.65 MB	Full docs corpus	“Scale outlier; VS Code tokenization, highlighting, scroll disabled on open; file exists, environment degraded”