Key Findings for Cascade’s Web Search Behavior, Cascade-interpreted

Test Workflow

Run python web_search_testing_framework.py --test {test ID} --track interpreted
Review terminal output
Copy the provided prompt asking agent to report on fetch results: character count, token estimate, truncation status, content completeness, Markdown formatting integrity, and tool visibility
Open a new Cascade session in Windsurf, paste the prompt into the chat window
Approve web fetch calls, but skip requests for runs of local scripts
Capture the agent’s full response, observations as the interpreted finding; the gap between the agent’s self-report and actual fetch behavior is a finding
Log structured metadata as described in framework-reference.md
Ensure log results saved to /results/cascade-interpreted/results.csv

cascade-implicit results document web fetch requests, but no explicit calls to Cascade’s tools - a pipeline with a two-stage chunked architecture wihtout a single-call full-page retrieval path; restricted to read_url_content → view_content_chunk; see Friction: Interpreted for analysis.

Platform Limit Summary

Limit	Observed
Hard Character Limit	None detected: `read_url_content` returns a chunked index, not raw content with a byte ceiling; output chars reflect agent chunk selection depth from a pipeline that has no full-page retrieval path
Hard Token Limit	None detected: estimates ranged from ~82 to ~65,000 tokens; no run hit a fixed ceiling
Output Consistency	Agent-dependent: same URL and prompt produces 0–106,000 chars depending on agent and chunk selection
Content Selection Behavior	Two-stage chunked retrieval: `read_url_content` returns a positional index with summaries; content requires sequential `view_content_chunk` calls per position
Truncation Pattern	Two independent truncation layers: agent chunk selection, most large page content never fetched; per-chunk display ceiling ~2K chars visible per chunk, remainder hidden with a byte-count notice
Redirect Chains	Consistent: tested 5-level redirect chain; returned inline without triggering chunked pipeline
Self-reported Completeness	Inconsistent: agents with identical content report contradictory truncation assessments; disagreement tracks chunk selection depth, not actual content loss
Chunk Summary Population	URL-dependent: well-structured pages return populated summaries providing navigational signal; CSS-heavy or SPAs may return empty summaries collapsing skimming into blind sampling
SPA extraction	Lossy by design: Go Colly static scraper delivers ~25–30% of raw HTML as extracted text; scripts, styles, and metadata discarded before delivery
Prompt Injection Sensitivity	Agent-dependent: `Claude Sonnet 4.6` triggered safety heuristics twice, refusing tool visibility reporting in one and full prompt execution in another

Results Details


Agent Selector	Hybrid Arena - 5 slots per run; one single-agent retry - `EC-6` run 6
Agents Observed	`Claude Sonnet 4.6`, `Claude Opus 4.6`, `GPT-5.3-Codex`, `GPT-5.4`, `Kimi K2.5`, `SWE-1.5`, `SWE-1.6`
Total Runs	61
Distinct URLs	11
Input Size Range	~2 KB - 256 KB
Truncation Events	27 / 61
Average Output Size	37,600 chars
Average Token Count	13,745 tokens
Approval-gated Fetch	49 / 61 runs prompted for approval
Auto-pagination	33 runs auto-paginated; 2 runs paginated when prompted
Complete Retrieval Failure	`SC-2` URL rewriting bug
URL Fragment Handling	`OP-1` `#History` fragment model-dependent, mostly absent; 1 of 5 agents reached targeted section

Agentic Pagination Depth

Agents consistently use read_url_content to fetch URLs, but depending on the state of the chunk index, they reason whether individual calls to view_content_chunk is worth it. While it determines output size and truncation self-report, chunks fetched is the primary behavioral variable in this dataset.

The tractability threshold is visible left-to-right: agents tend toward full retrieval on chunk counts ≤14 and toward sparse sampling on larger ones ≥50, with 33–38 chunks as the transition zone where model families diverge. Opus 4.6 and SWE-1.6 show the most consistent full-retrieval behavior while GPT-5.3-Codex and Kimi K2.5 default to sparse sampling regardless of chunk count.

Truncation Analysis

#	Finding	Tests	Observed	Conclusion
1	`read_url_content` returns chunk index	All tests	Requires `view_content_chunk` × N; no single-call full-page retrieval path	Output chars reflect chunks fetched, not retrieval ceiling; variance is behavioral, not architectural
2	No fixed character or token ceiling detected	`BL-1` `BL-3` `EC-6`	`BL-1` `Opus` full retrieval estimated ~55,000–65,000 tokens; `EC-6` `SWE` measured 58,947 chars with no cutoff; `BL-3` `Opus` retrieved ~106,000 visible chars from 53 chunks	If ceiling exists, no test hit it; constraint is chunks fetched, not a tool-imposed byte limit
3	Per-chunk display truncation is a second independent layer	`BL-1` `SC-4` `EC-6`	`view_content_chunk` hides middle portion of large chunks with explicit byte-count notice; `SC-4` showed 3,736 bytes hidden across 4 positions; `BL-1` `Opus` found 132 KB hidden across 51 of 54 chunks	Full chunk retrieval doesn’t guarantee full content delivery; internal truncation invisible
4	Truncation self-report tracks chunks fetched, not content loss	`SC-4` `SC-3` `BL-3`	`SC-4` - agents sampling 3 chunks reported no truncation; agents retrieving all 33 reported byte-level notices at 4 positions; same pattern repeated across tests	Self-reported truncation accurate for chunks seen, not accurate for doc; agents conflate retrieval completeness with content fidelity
5	Chunk summary population determines retrieval strategy quality	`SC-1` `SC-3` `BL-3` `OP-4`	`SC-1` populated summaries enabled selective exclusion; `BL-3` and `OP-4` empty summaries collapsed skimming to blind sampling; `SC-3` populated summaries present, but unused above ~50 chunks	“Human skim” behavior requires populated summaries to function; empty summaries are URL-dependent, not universal failure; populated summaries provide signal but don’t guarantee targeted retrieval
6	SPA sources produce an extraction ratio gap, not a truncation event	`EC-1`	Go Colly static scraper delivers ~25–30% of raw HTML; gap is architectural and consistent, not stochastic; `H1` untestable on SPAs	3 untestable conditions confirmed: `BL-1` chunks fetched, `EC-1` extraction ratio, `EC-3` payload size
7	Routing bypasses chunked pipeline for small payloads	`EC-3`	`read_url_content` returned 5-redirect-chain terminal response inline ~306–424 chars; `view_content_chunk` not called in any runs	Chunked architecture has at least two modes; small payloads return inline without triggering the two-fetch process
8	`search_web` called once across 61 runs	All tests	`SWE-1.6` invoked `search_web` once in `SC-2` after two failures; zero calls elsewhere, including runs where agents expressed explicit uncertainty	URL provision alone doesn’t activate `search_web`; single call was a fallback, not retrieval enrichment; `H4` mostly untested
9	Prompt injection sensitivity produced two refusals	`OP-4` `EC-6`	`OP-4` `Sonnet` declined tool reporting; `EC-6` `Sonnet` refused full prompt execution, flagging tool names, URL, and framing as injection signals; single retry success	Safety heuristic sensitivity is prompt-dependent and not consistently reproducible
10	URL rewriting is a tool-layer bug, not agent behavior	`SC-2`	`read_url_content` silently rewrites `docs.anthropic.com/en/api/messages` to `llms-full.txt`, redirecting to a dead endpoint; all runs failed	`SC-2` hypotheses untestable until rewriting resolved
11	URL fragment targeting is behavioral, not architectural	`OP-1`	4 of 5 runs treated the fragment URL as a generic retrieval target; `SWE-1.5` only agent to confirm `#History` chunk position; chunk index supported targeting, agents largely didn’t use it	Fragment targeting is achievable via chunk index, but absent by default; miss rate is behavioral, not tool limitation
12	Selective semantic processing applies to content, not shell	`EC-1` `BL-3` `OP-4`	Tool strips HTML, converts prose to Markdown, summarizes chunk index entries, but passes nav chrome, responsive breakpoint duplicates, and pre/post-render DOM states through verbatim without de-duplication	Selective content transformation; page structure extracted raw with CSS noise, nav duplication, SPA extraction artifacts across multiple tests

Perception Gap

Output chars aren’t an appropriate truncation ceiling metric; they reflect how much the tools discard before delivery through chunk count, content transformation. Neither is observable from the interpreted track alone.

Test	Expected	Received	Delivery Ratio	Agent Characterization
`EC-6` Raw Markdown	~61 KB	58,947 chars `SWE` full retrieval	~97%	“No truncation, structurally complete - tool transforms content before delivery, exact char count unverifiable”
`SC-4` Markdown Guide	~30 KB	~24,100–29,000 chars; full retrieval runs	~78–94%	“Substantially complete but not byte-for-byte faithful - code examples flattened, tables stripped”
`EC-1` SPA	~100 KB	~22,500–53,000 chars extracted	~22–53%	“Extraction ratio, not truncation - tool delivers ~25–30% of raw HTML by design”
`SC-3` Wikipedia	~102 KB	`Kimi` ~6,777 chars received to `Sonnet` ~150,000 chars extrapolated	varies by method	“No truncation, index complete, vs yes truncation, content withheld”
`BL-3` Tutorial	~256 KB	`Opus` ~106,000 chars visible 53 chunks	~41% visible; ~56% layer 2 loss	“Double-truncated - chunked then per-chunk display-capped; tutorial content inaccessible”
`EC-3` Redirect JSON	~2 KB	306–424 chars	~15–21% of expected	“Complete - JSON payload is the full response; size gap reflects per-request header variance”