Key Findings for Cascade’s Web Search Behavior, Explicit @web
The explicit track confirms that @web doesn’t meaningfully change the retrieval behavior the interpreted track
identified. Core findings hold: chunked architecture, no fixed ceiling, index-size suppression threshold,
CSS extraction failure, and self-reporting fidelity gaps. Extensions:
@webis redundant with a URL- Wider agent pool:
Gemini 3.1,GLM-5.1,GPT-5.4,o3 SC-2chunk sampling data- More precise fidelity failure characterization
- Colly in toolchain
- Tool wrapper preamble inflates character counts
Test Workflow
- Run
python web_search_testing_framework.py --test {test ID} --track explicit - Review terminal output
- Copy the provided prompt asking agent to report on fetch results: character count, token estimate, truncation status, content completeness, Markdown formatting integrity, and tool visibility
- Open a new Cascade session in Windsurf, paste the prompt into the chat window
- Approve web fetch calls, but skip requests for runs of local scripts
- Capture the agent’s full response, observations as the explicit finding; the gap between the agent’s self-report and actual fetch behavior is a finding
- Log structured metadata as described in
framework-reference.md - Ensure log results saved to
/results/cascade-explicit/results.csv
@webmapped toread_url_contentin all runs;search_webcalled once; analysis in Friction: Explicit.
Platform Limit Summary
| Limit | Observed |
|---|---|
| Hard Character Limit | None detected: read_url_content returns a chunked index, not raw content with a byte ceiling; output chars reflect agent chunk selection depth from a pipeline that has no full-page retrieval path |
| Hard Token Limit | None detected: estimates ranged from ~91-85,000 tokens; no run hit a fixed ceiling |
| Output Consistency | Agent-dependent, self-reported: same URL and prompt produces ~365–350,000 chars depending on agent and chunk selection; figures without verification script cross-reference; some values retrieved content, others full-doc extrapolations |
| Content Selection Behavior | Two-stage chunked retrieval: read_url_content returns a positional index with summaries; content requires sequential view_content_chunkcalls per position |
| Truncation Pattern | Two independent truncation layers: agent chunk selection, most large page content never fetched; per-chunk display ceiling variable by chunk, remainder hidden with a byte-count notice |
| Redirect Chains | Consistent: tested 5-level redirect chain; returned inline without triggering chunked pipeline |
| Self-reported Completeness | Inconsistent: agents with identical content report contradictory truncation assessments; disagreement tracks chunk selection depth, not actual content loss |
| Chunk Summary Population | URL-dependent: well-structured pages return populated summaries providing navigational signal; CSS-heavy or SPAs may return empty summaries collapsing skimming into blind sampling |
| SPA extraction | Lossy by design: Go Colly static scraper delivers ~20–35% of expected rendered page size as extracted text; EC-1 runs ~20,000–35,500 chars from ~100 KB source; HTML stripped, JavaScript not executed before delivery; gap invisible to agents evaluating completenesswithin the tool’s output frame |
@web directive |
Redundant for URL fetch: @web maps to read_url_content across all agents, all runs; search_web called once for SC-2’s GLM-5.1 run as verification attempt; didn’t return usable content |
| Agent Self-Reporting Fidelity | Unreliable: thought panels display collapsed passes and/or batch reads, re-reads not disclosed in output; fidelity failures documented acrossBL-3, OP-4, SC-1, SC-2, SC-4 |
Results Details
| Agent Selector | Hybrid Arena - 5 slots per run; 10 BL-1 runs for prompt variant testing;1 single-agent retry - EC-1 run 6 |
| Agents Observed | Claude Opus 4.7, Claude Sonnet 4.6,Gemini 3.1, GLM-5.1, GPT-5.3-Codex,GPT-5.4, Kimi K2.5, o3, SWE-1.6 |
| Total Runs | 66 |
| Distinct URLs | 11 |
| Input Size Range | ~2 KB - 256 KB |
| Truncation Events | 35 / 66 |
| Average Output Size | 43,441 chars |
| Average Token Count | 13,320 tokens |
| Approval-gated Fetch | 58 / 66 runs prompted for approval |
| Auto-pagination | 35 runs auto-paginated; 1 run paginated when prompted |
| Complete Retrieval Failure | EC-1 run 5 Claude Sonnet 4.6: infrastructure error;no tool call completed, no output; rerun succeeded |
| Content Targeting Failure |
SC-2 all followed redirect to llms-full.txt,delivering all Anthropic docs instead of Messages API page, analysis in Friction: Explicit |
| URL Fragment Handling |
OP-1 #History fragment not consistently honored;3 of 5 agents reached targeted section |
Agentic Pagination Depth
As observed in the interpreted track, agents consistently use read_url_content to fetch URLs, but depending on
the state of the chunk index, they reason whether individual calls to view_content_chunk is worth it. While it
determines output size and truncation self-report, chunks fetched remains the primary behavioral variable in this
dataset.
The tractability threshold is visible across tests: agents tend toward full retrieval on
chunk counts ≤14 and toward sparse sampling on larger ones ≥50, with 33–38 chunks as the transition
zone where model families diverge. SWE shows the most consistent full-retrieval behavior while
GLM, GPT and Kimi use sparse sampling more than any other technique.
BL-3 and OP-4 use the same source URL across different test sessions, making them the only direct
same-source comparison in the dataset. Retrieval depth isn’t consistent across sessions: SWE retrieved
all chunks in OP-4 but only 1 chunk in BL-3; GPT-5.3-Codex retrieved 1 chunk in OP-4 and stopped
at the index in BL-3. GLM and Kimi were more stable across sessions, but no agent produced identical
retrieval depth in both runs. Retrieval depth may reflect a mix of verified and unverified session conditions
such as context window state, tool response latency, arena slot position, Windsurf version, and chunk summary
content - as much as individual agent capability.
@web: Routing Hint, not Retrieval Modifier
Across all runs, no agent said the obvious thing: @web is redundant with a URL. Agents exhibited a wide range
of architectural understanding from non-recognition to mechanical familiarity of the underlying parsing service,
without mentioning that in this context, calling it would produce no behavioral difference. This absence is the
H4 finding; analysis in Friction: Explicit.
SWE-1.6
No Product Knowledge"I don't have a tool called @web. Was there a different tool you intended for me to use?"
GPT-5.3-Codex
Tool Mapping Only"@web maps conceptually to web retrieval tooling - read_url_content under the hood."
Kimi K2.5
Tool Mapping Only"@web is a user-facing alias that routes to read_url_content as the fetch mechanism."
Claude Sonnet 4.6
UI Awareness"Not a distinct tool or named symbol - a user-facing shorthand in the Windsurf chat UI."
Claude Opus 4.6-7
Conditional Routing"@web resolves to read_url_content for a direct URL fetch; would resolve to search_web for a query-based task."
GPT-5.4
Conditional Routing"Direct URL → read_url_content; search query → search_web. No literal tool named @web."
GLM-5.1
Pipeline Depth"A directive that triggers the chunking pipeline, not a separate tool, a user-facing shorthand."
o3
Pipeline Depth"@web maps to read_url_content followed by view_content_chunk to stream the page content."
Gemini 3.1
Implementation Detail"A macro proxying to read_url_content - runs the page through a parsing service that breaks the document into an AST-like structure chunked by headings."
Truncation Analysis
| # | Finding | Tests | Observed | Conclusion |
|---|---|---|---|---|
| 1 | read_url_content returns chunk index |
All tests | Requires view_content_chunk × N;no single-call full-page retrieval path |
Output chars reflect chunks fetched, not retrieval ceiling; variance behavioral, not architectural |
| 2 | No fixed character or token ceiling detected | BL-1EC-6SC-4 |
BL-1 Opus estimated ~120,000–200,000 chars across 54 chunks;EC-6 SWE measured 61,921 chars with no cutoff; SC-4 o3 summed 34,200 chars across 33 chunks |
If ceiling exists, no test hit it; constraint is chunks fetched, not a tool-imposed byte limit |
| 3 | Per-chunk display truncation is a second independent layer |
BL-1SC-4OP-4 |
view_content_chunk hides middle portion of large chunks with explicit byte-count notice;SC-4 SWE found 3,766 bytes hidden across 4 positions; OP-4 SWE found truncation warnings on all 53 chunks ranging 367–24,204 bytes |
Full chunk retrieval doesn’t guarantee full content delivery; internal truncation invisible |
| 4 | Truncation self-report tracks chunks fetched, not content loss | SC-4BL-3SC-3 |
Agents sampling 3 chunks reported no truncation; agents retrieving all 33 found byte-level notices at 4 positions; SWE and o3 full-retrieval contradiction on identical source |
Self-reported truncation accurate for chunks seen, not accurate for doc; agents conflate retrieval completeness with content fidelity |
| 5 | Chunk summary population determines retrieval strategy quality | SC-1SC-3BL-3OP-4 |
SC-1 populated summaries enabled chrome exclusion before fetching; BL-3 and OP-4 empty summaries "/" collapsed skimming to blind sampling; SC-3 populated summaries present, but unused above ~50 chunks |
Index-guided targeting requires populated summaries; populated summaries provide signal but don’t guarantee targeted retrieval |
| 6 | SPA sources produce an extraction ratio gap, not a truncation event | EC-1 |
Go Colly static scraper delivers ~20–35% of raw HTML as extracted text; ~70 KB gap on a ~100 KB page, suggesting gap is architectural | Agents evaluate completeness within tool output frame, characterize gap as pipeline transformation, not content loss |
| 7 | Routing bypasses chunked pipeline for small payloads | EC-3 |
read_url_content returned5 redirect-chain terminal JSON response inline ~353–367 chars body; view_content_chunknot called in any run |
Chunked architecture has at least two modes; small payloads return inline without triggering the two-fetch process |
| 8 | @web redundant with URLs |
All tests | Most agents used toolchain identical to interpreted track: read_url_content → view_content_chunk |
@web produced no behavior change; H4 confirmed redundant |
| 9 | @web conditional routing described consistently |
SC-1SC-2SC-4EC-6 |
@web + URL → read_url_content; @web + query → search_web; GLM-5.1 invoked search_web once duringSC-2 as an independent verification, but returnednear-empty results |
@web is a routing hint; search_web verification call distinct from @web-driven routing, didn’t produce usable output |
| 10 | Agent self-reporting fidelity is a systematic confound | SC-2OP-4BL-3SC-1SC-4 |
Under-reporting; partial reporting; parallel execution opacity | Don’t treat agent self-report as complete record, add thought panel cross-reference; analysis in Friction: Explicit |
| 11 | Index size suppresses auto-pagination above ~50 chunks | SC-3OP-1BL-3OP-4 |
Maximum chunks retrieved:SC-3: 6/60, OP-1: 5/91,BL-3: 19/53,OP-4, SWE only: 53/53 |
Tractability threshold is agent-dependent, index-size-sensitive; 33–38 chunks is transition zone where agents diverge |
| 12 | CSS-heavy sources produce content extraction failure, not truncation | BL-1BL-3OP-4 |
MongoDB LeafyGreen CSS dominated chunk content across all runs on three distinct MongoDB URLs; tutorial body content absent across all 53 chunks in all BL-3 runs; “Structurally complete, semantically incomplete” |
Page navigation and chrome recovered; article content inaccessible regardless of retrieval depth |
| 13 | Tool wrapper preamble inflates character counts | EC-3 |
Claude Opus 4.7 identified and quoted the preamble string "Here is the content of the article at [URL]" prepended by read_url_content; explains cross-run variance on identical content |
Variance between runs on identical content reflects tool wrapper inclusion rules, not retrieval differences |
| 14 | Colly identified as fetch backend | EC-3 |
GLM-5.1 and Claude Opus 4.7 independently identifiedUser-Agent: colly — https://github.com/gocolly/colly from httpbin’s echoed request headers |
Windsurf uses scraping library; possibly explains CSS and/or SPA extraction gap |
| 15 | Per-chunk byte ceiling may reflect server-side rate limiting, not a tool gate | SC-2 |
SWE, GLM hit 17,993-byte truncation at chunk 1008, mid-identifier inside BetaManagedAgentsModelRateLimitedError; likely HTTP response complete, but agent abstracted |
Unresolvable from agent self-report, raw track required |
Perception Gap
Output chars aren’t an appropriate truncation ceiling metric; they reflect chunk count, content transformation, and tool wrapper inclusion rules. None is observable from agent self-report alone.
| Test | Expected | Received | Delivery Ratio | Agent Characterization |
|---|---|---|---|---|
EC-6Raw Markdown |
~61 KB | 61,921 charsSWE full retrieval |
~97% | “No truncation, structurally complete; tool transforms content before delivery” |
SC-4Markdown Guide |
~30 KB | ~15,500–34,200 chars; full retrieval runs | ~52–114%* | “Complete but contradicted; SWE found truncation at 4 positions; o3 found noneon identical content” |
EC-1SPA |
~100 KB | ~20,100–35,500 chars extracted | ~20–36% | “Extraction ratio, not truncation, HTML stripped and JavaScript not executed before delivery” |
SC-3Wikipedia |
~100 KB | ~4,900 chars index to ~150,000 chars extrapolated | varies by method | “No truncation, index complete vs yes, 57/60 chunks never fetched” |
BL-3CSS Tutorial |
~256 KB | ~2,598–350,000 chars across runs | indeterminate | “Structurally complete, semantically incomplete; tutorial body absent across all chunks” |
EC-3Redirect JSON |
~2 KB | ~353–367 chars body |
~15–18% of expected | “Complete; JSON payload is the full response; size gap reflects redirect chain delivering terminal response only” |
*
SC-4figures above 100% reflect counting method differences, not over-retrieval.SWEando3both retrieved all 33 chunks and reported estimates differing by ~18,700 chars; the largest same-source, same-depth variance in the dataset.
Agent Ecosystem Testing