Key Findings for Cascade’s Web Search Behavior, Cascade-interpreted
Cascade-interpreted Test Workflow
- Run
python web_search_testing_framework.py --test {test ID} --track interpreted - Review the terminal output
- Copy the provided prompt asking the agent to report on fetch results: character count, token estimate, truncation status, content completeness, Markdown formatting integrity, and tool visibility
- Open a new Cascade session in Windsurf and paste the prompt into the chat window
- Approve web fetch calls, but skip requests for runs of local scripts
- Capture the agent’s full response and observations as the interpreted finding; the gap between the agent’s self-report and actual fetch behavior is a finding
- Log structured metadata as described in
framework-reference.md - Ensure log results are saved to
/results/cascade-interpreted/results.csv
cascade-implicitresults document web fetch requests, but no explicit calls to Cascade’s tools - a pipeline with a two-stage chunked architecture wihtout a single-call full-page retrieval path; restricted toread_url_content→view_content_chunk; see Friction: Interpreted for analysis.
Platform Limit Summary
| Limit | Observed |
|---|---|
| Hard Character Limit | None detected: read_url_content returns a chunked index, not raw content with a byte ceiling; output chars reflect agent chunk selection depth from a pipeline that has no full-page retrieval path |
| Hard Token Limit | None detected: estimates ranged from ~82 to ~65,000 tokens; no run hit a fixed ceiling |
| Output Consistency | Agent-dependent: same URL and prompt produces 0–106,000 chars depending on agent and chunk selection |
| Content Selection Behavior | Two-stage chunked retrieval: read_url_content returns a positionalindex with summaries; content requires sequential view_content_chunkcalls per position |
| Truncation Pattern | Two independent truncation layers: agent chunk selection, most large page content never fetched; per-chunk display ceiling ~2K chars visible per chunk, remainder hidden with byte-count notice |
| Redirect Chains | Consistent: tested 5-level redirect chain; returned inline without triggering chunked pipeline |
| Self-reported Completeness | Inconsistent: agents with identical content report contradictory truncation assessments; disagreement tracks chunk selection depth, not actual content loss |
| Chunk Summary Population | URL-dependent: well-structured pages return populated summaries providing navigational signal; CSS-heavy or SPAs may return empty summaries collapsing skimming into blind sampling |
| SPA extraction | Lossy by design: Go Colly static scraper delivers ~25–30% of raw HTML as extracted text; scripts, styles, and metadata discarded before delivery |
| Prompt Injection Sensitivity | Agent-dependent: Claude Sonnet 4.6 triggered safety heuristics twice, refusing tool visibility reporting in one and fullprompt execution in another |
Results Details
| Agent Selector | Hybrid Arena - 5 slots per run; one single-agent retry - EC-6 run 6 |
| Agents Observed | Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.3-Codex,GPT-5.4, Kimi K2.5, SWE-1.5, SWE-1.6 |
| Total Runs | 61 |
| Distinct URLs | 11 |
| Input Size Range | ~2 KB – 256 KB |
| Truncation Events | 27 / 61 |
| Average Output Size | 37,600 chars |
| Average Token Count | 13,745 tokens |
| Approval-gated Fetch | 49 / 61 runs prompted for approval |
| Auto-pagination | 33 runs auto-paginated; 2 runs paginated when prompted |
| Complete Retrieval Failure | SC-2 URL rewriting bug |
Agentic Pagination Depth
Agents consistently use read_url_content to fetch URLs, but depending on the state of the chunk index, they reason whether individual
calls to view_content_chunk is worth it. While it determines output size and truncation self-report, chunks fetched is the primary
behavioral variable in this dataset.
The tractability threshold is visible left-to-right: agents tend toward full retrieval on chunk counts ≤14 and toward sparse sampling on
larger ones ≥50, with 33–38 chunks as the transition zone where model families diverge. Opus 4.6 and SWE-1.6 show the most consistent
full-retrieval behavior while GPT-5.3-Codex and Kimi K2.5 default to sparse sampling regardless of chunk count.
Truncation Analysis
| # | Finding | Tests | Observed | Conclusion |
|---|---|---|---|---|
| 1 | read_url_content returnschunk index |
All tests | Requires view_content_chunk × N; no single-call full-page retrieval path |
Output chars reflect chunks fetched, not retrieval ceiling; variance is behavioral, not architectural |
| 2 | No fixed character or token ceiling detected | BL-1BL-3EC-6 |
BL-1 Opus full retrieval estimated ~55,000–65,000 tokens; EC-6 SWE measured 58,947 chars with no cutoff; BL-3 Opus retrieved ~106,000 visible chars from 53 chunks |
If ceiling exists, no test hit it; constraint is chunks fetched, not a tool-imposed byte limit |
| 3 | Per-chunk display truncation is a second independent layer |
BL-1SC-4EC-6 |
view_content_chunk hides middle portion of large chunks with explicit byte-count notice; SC-4 showed 3,736 byteshidden across 4 positions; BL-1 Opus found 132 KBhidden across 51 of 54 chunks |
Full chunk retrieval doesn’t guarantee full content delivery; internal truncation invisible |
| 4 | Truncation self-report tracks chunks fetched, not content loss | SC-4SC-3BL-3 |
SC-4 - agents sampling 3 chunks reported no truncation; agents retrieving all 33 reported byte-level notices at 4 positions; same pattern repeated across tests |
Self-reported truncation accurate for chunks seen, not accurate for doc; agents conflate retrieval completeness with content fidelity |
| 5 | Chunk summary population determines retrieval strategy quality | SC-1SC-3BL-3OP-4 |
SC-1 populated summaries enabled selective exclusion; BL-3 and OP-4 empty summaries collapsed skimming to blind sampling; SC-3 populated summaries present, but unused above ~50 chunks |
“Human skim” behavior requires populated summaries to function; empty summaries are URL-dependent, not universal failure; populated summaries provide signal but don’t guarantee targeted retrieval |
| 6 | SPA sources produce an extraction ratio gap, not a truncation event | EC-1 |
Go Colly static scraper delivers ~25–30% of raw HTML; gap is architectural and consistent, not stochastic;H1 untestable on SPAs |
3 untestable conditions confirmed: BL-1 chunks fetched, EC-1 extraction ratio, EC-3payload size |
| 7 | Routing bypasses chunked pipeline for small payloads | EC-3 |
read_url_content returned 5-redirect-chain terminal response inline ~306–424 chars; view_content_chunknot called in any runs |
Chunked architecture has at least two modes; small payloads return inline without triggering the two-fetch process |
| 8 | search_web called once across 61 runs |
SC-2all others |
SWE-1.6 invoked search_web once in SC-2 after two failures; zero calls elsewhere, including runs where agents expressed explicit uncertainty |
URL provision alone doesn’t activate search_web; single call was a fallback, not retrieval enrichment; H4 mostly untested |
| 9 | Prompt injection sensitivity produced two refusals | OP-4EC-6 |
OP-4 Sonnet declined tool reporting; EC-6 Sonnet refused full prompt execution, flagging tool names, URL, and framing as injection signals; single retry success |
Safety heuristic sensitivity is prompt-dependent and not consistently reproducible |
| 10 | URL rewriting is a tool-layer bug, not agent behavior | SC-2 |
read_url_contentsilently rewrites docs.anthropic.com/en/api/messages to llms-full.txt, redirecting to a dead endpoint;all runs failed |
SC-2 hypotheses untestable until rewriting resolved |
| 11 | URL fragment targeting is behavioral, not architectural | OP-1 |
4 of 5 runs treated the fragment URL as a generic retrieval target; SWE-1.5 only agent to confirm #History chunk position; chunk index supported targeting, agents largely didn’t use it |
Fragment targeting is achievable via chunk index, but absent by default; miss rate is behavioral, not tool limitation |
| 12 | Selective semantic processing applies to content, not shell | EC-1BL-3OP-4 |
Tool strips HTML, converts prose to Markdown, summarizes chunk index entries, but passes nav chrome, responsive breakpoint duplicates, and pre/post-render DOM states through verbatim without de-duplication |
Selective content transformation; page structure extracted raw with CSS noise, nav duplication, SPA extraction artifacts across multiple tests |
Perception Gap
| Test | Expected | Received | Retrieval rate | Agent characterization |
|---|---|---|---|---|
EC-6Raw Markdown |
~61 KB | 58,947 charsSWE fullretrieval |
~97% | “No truncation, structurally complete — tool transforms content before delivery, exact char count unverifiable” |
SC-4 Markdown Guide |
~31 KB | ~24,100–29,000 chars; full retrieval runs | ~78–94% | “Substantially complete but not byte-for-byte faithful — code examples flattened, tables stripped” |
EC-1SPA |
~100 KB | ~22,500–53,000 chars extracted | ~22–53% | “Extraction ratio, not truncation — tool delivers ~25–30% of raw HTML by design” |
SC-3 Wikipedia |
~102 KB | Kimi ~6,777 chars received to Sonnet ~150,000 chars extrapolated |
varies by method |
“No truncation, index complete, vs yes truncation, content withheld” |
BL-3 Tutorial |
~256 KB | Opus ~106,000 chars visible53 chunks |
~41% visible; ~56% layer 2 loss | “Double-truncated — chunked then per-chunk display-capped; tutorial content inaccessible” |
EC-3 RedirectJSON |
~2 KB | 306–424 chars | ~15–21% of expected | “Complete — JSON payload is the full response; size gap reflects per-request header variance” |
Implication: output chars aren’t an appropriate truncation ceiling metric for Cascade; they reflect chunk count and content transformation - how much the tools discard before delivery. Neither is observable from the interpreted track alone.
Agent Ecosystem Testing