Key Findings for Cascade’s Web Search Behavior -
Explicit `@web`

The explicit track confirms that @web doesn’t meaningfully change the retrieval behavior the interpreted track identified. Core findings hold: chunked architecture, no fixed ceiling, index-size suppression threshold, CSS extraction failure, and self-reporting fidelity gaps. Extensions:

@web is redundant with a URL
Wider agent pool: Gemini 3.1, GLM-5.1, GPT-5.4, o3
SC-2 chunk sampling data
More precise fidelity failure characterization
Colly in toolchain
Tool wrapper preamble inflates character counts

Test Workflow

Run python web_search_testing_framework.py --test {test ID} --track explicit
Review terminal output
Copy the provided prompt asking agent to report on fetch results: character count, token estimate, truncation status, content completeness, Markdown formatting integrity, and tool visibility
Open a new Cascade session in Windsurf, paste the prompt into the chat window
Approve web fetch calls, but skip requests for runs of local scripts
Capture the agent’s full response, observations as the explicit finding; the gap between the agent’s self-report and actual fetch behavior is a finding
Log structured metadata as described in framework-reference.md
Ensure log results saved to /results/cascade-explicit/results.csv

@web mapped to read_url_content in all runs; search_web called once; analysis in Friction: Explicit.

Platform Limit Summary

Limit	Observed
Hard Character Limit	None detected: `read_url_content` returns a chunked index, not raw content with a byte ceiling; output chars reflect agent chunk selection depth from a pipeline that has no full-page retrieval path
Hard Token Limit	None detected: estimates ranged from ~91-85,000 tokens; no run hit a fixed ceiling
Output Consistency	Agent-dependent, self-reported: same URL and prompt produces ~365–350,000 chars depending on agent and chunk selection; figures without verification script cross-reference; some values retrieved content, others full-doc extrapolations
Content Selection Behavior	Two-stage chunked retrieval: `read_url_content` returns a positional index with summaries; content requires sequential `view_content_chunk` calls per position
Truncation Pattern	Two independent truncation layers: agent chunk selection, most large page content never fetched; per-chunk display ceiling variable by chunk, remainder hidden with a byte-count notice
Redirect Chains	Consistent: tested 5-level redirect chain; returned inline without triggering chunked pipeline
Self-reported Completeness	Inconsistent: agents with identical content report contradictory truncation assessments; disagreement tracks chunk selection depth, not actual content loss
Chunk Summary Population	URL-dependent: well-structured pages return populated summaries providing navigational signal; CSS-heavy or SPAs may return empty summaries collapsing skimming into blind sampling
SPA extraction	Lossy by design: Go Colly static scraper delivers ~20–35% of expected rendered page size as extracted text; `EC-1` runs ~20,000–35,500 chars from ~100 KB source; HTML stripped, JavaScript not executed before delivery; gap invisible to agents evaluating completeness within the tool’s output frame
`@web` directive	Redundant for URL fetch: `@web` maps to `read_url_content` across all agents, all runs; `search_web` called once for `SC-2`’s `GLM-5.1` run as verification attempt; didn’t return usable content
Agent Self-Reporting Fidelity	Unreliable: thought panels display collapsed passes and/or batch reads, re-reads not disclosed in output; fidelity failures documented across `BL-3`, `OP-4`, `SC-1`, `SC-2`, `SC-4`

Results Snapshot


Agent Selector	Hybrid Arena - 5 slots per run; 10 `BL-1` runs for prompt variant testing; 1 single-agent retry - `EC-1` run 6
Agents Observed	`Claude Opus 4.7`, `Claude Sonnet 4.6`, `Gemini 3.1`, `GLM-5.1`, `GPT-5.3-Codex`, `GPT-5.4`, `Kimi K2.5`, `o3`, `SWE-1.6`
Total Runs	66
Distinct URLs	11
Input Size Range	~2 KB - 256 KB
Truncation Events	35 / 66
Average Output Size	43,441 chars
Average Token Count	13,320 tokens
Approval-gated Fetch	58 / 66 runs prompted for approval
Auto-pagination	35 runs auto-paginated; 1 run paginated when prompted
Complete Retrieval Failure	`EC-1` run 5 `Claude Sonnet 4.6`: infrastructure error; no tool call completed, no output; rerun succeeded
Content Targeting Failure	`SC-2` all followed redirect to `llms-full.txt`, delivering all Anthropic docs instead of Messages API page, analysis in Friction: Explicit
URL Fragment Handling	`OP-1` `#History` fragment not consistently honored; 3 of 5 agents reached targeted section

Agentic Pagination Depth

As observed in the interpreted track, agents consistently use read_url_content to fetch URLs, but depending on the state of the chunk index, they reason whether individual calls to view_content_chunk is worth it. While it determines output size and truncation self-report, chunks fetched remains the primary behavioral variable in this dataset.

The tractability threshold is visible across tests: agents tend toward full retrieval on chunk counts ≤14 and toward sparse sampling on larger ones ≥50, with 33–38 chunks as the transition zone where model families diverge. SWE shows the most consistent full-retrieval behavior while GLM, GPT and Kimi use sparse sampling more than any other technique.

BL-3 and OP-4 use the same source URL across different test sessions, making them the only direct same-source comparison in the dataset. Retrieval depth isn’t consistent across sessions: SWE retrieved all chunks in OP-4 but only 1 chunk in BL-3; GPT-5.3-Codex retrieved 1 chunk in OP-4 and stopped at the index in BL-3. GLM and Kimi were more stable across sessions, but no agent produced identical retrieval depth in both runs. Retrieval depth may reflect a mix of verified and unverified session conditions such as context window state, tool response latency, arena slot position, Windsurf version, and chunk summary content - as much as individual agent capability.

`@web`: Routing Hint, not Retrieval Modifier

Across all runs, no agent said the obvious thing: @web is redundant with a URL. Agents exhibited a wide range of architectural understanding from non-recognition to mechanical familiarity of the underlying parsing service, without mentioning that in this context, calling it would produce no behavioral difference. This absence is the H4 finding; analysis in Friction: Explicit.

SWE-1.6

No Product Knowledge

"I don't have a tool called @web. Was there a different tool you intended for me to use?"

GPT-5.3-Codex

Tool Mapping Only

"@web maps conceptually to web retrieval tooling - read_url_content under the hood."

Kimi K2.5

Tool Mapping Only

"@web is a user-facing alias that routes to read_url_content as the fetch mechanism."

Claude Sonnet 4.6

UI Awareness

"Not a distinct tool or named symbol - a user-facing shorthand in the Windsurf chat UI."

Claude Opus 4.6-7

Conditional Routing

"@web resolves to read_url_content for a direct URL fetch; would resolve to search_web for a query-based task."

GPT-5.4

Conditional Routing

"Direct URL → read_url_content; search query → search_web. No literal tool named @web."

GLM-5.1

Pipeline Depth

"A directive that triggers the chunking pipeline, not a separate tool, a user-facing shorthand."

Pipeline Depth

"@web maps to read_url_content followed by view_content_chunk to stream the page content."

Gemini 3.1

Implementation Detail

"A macro proxying to read_url_content - runs the page through a parsing service that breaks the document into an AST-like structure chunked by headings."

Truncation Analysis

#	Finding	Tests	Observed	Conclusion
1	`read_url_content` returns chunk index	All tests	Requires `view_content_chunk` × N; no single-call full-page retrieval path	Output chars reflect chunks fetched, not retrieval ceiling; variance behavioral, not architectural
2	No fixed character or token ceiling detected	`BL-1` `EC-6` `SC-4`	`BL-1` `Opus` estimated ~120,000–200,000 chars across 54 chunks; `EC-6` `SWE` measured 61,921 chars with no cutoff; `SC-4` `o3` summed 34,200 chars across 33 chunks	If ceiling exists, no test hit it; constraint is chunks fetched, not a tool-imposed byte limit
3	Per-chunk display truncation is a second independent layer	`BL-1` `SC-4` `OP-4`	`view_content_chunk` hides middle portion of large chunks with explicit byte-count notice; `SC-4` `SWE` found 3,766 bytes hidden across 4 positions; `OP-4` `SWE` found truncation warnings on all 53 chunks ranging 367–24,204 bytes	Full chunk retrieval doesn’t guarantee full content delivery; internal truncation invisible
4	Truncation self-report tracks chunks fetched, not content loss	`SC-4` `BL-3` `SC-3`	Agents sampling 3 chunks reported no truncation; agents retrieving all 33 found byte-level notices at 4 positions; `SWE` and `o3` full-retrieval contradiction on identical source	Self-reported truncation accurate for chunks seen, not accurate for doc; agents conflate retrieval completeness with content fidelity
5	Chunk summary population determines retrieval strategy quality	`SC-1` `SC-3` `BL-3` `OP-4`	`SC-1` populated summaries enabled chrome exclusion before fetching; `BL-3` and `OP-4` empty summaries `"/"` collapsed skimming to blind sampling; `SC-3` populated summaries present, but unused above ~50 chunks	Index-guided targeting requires populated summaries; populated summaries provide signal but don’t guarantee targeted retrieval
6	SPA sources produce an extraction ratio gap, not a truncation event	`EC-1`	Go Colly static scraper delivers ~20–35% of raw HTML as extracted text; ~70 KB gap on a ~100 KB page, suggesting gap is architectural	Agents evaluate completeness within tool output frame, characterize gap as pipeline transformation, not content loss
7	Routing bypasses chunked pipeline for small payloads	`EC-3`	`read_url_content` returned 5 redirect-chain terminal JSON response inline ~353–367 chars body; `view_content_chunk` not called in any run	Chunked architecture has at least two modes; small payloads return inline without triggering the two-fetch process
8	`@web` redundant with URLs	All tests	Most agents used toolchain identical to interpreted track: `read_url_content` → `view_content_chunk`	`@web` produced no behavior change; `H4` confirmed redundant
9	`@web` conditional routing described consistently	`SC-1` `SC-2` `SC-4` `EC-6`	`@web` + URL → `read_url_content`; `@web` + query → `search_web`; `GLM-5.1` invoked `search_web` once during `SC-2` as an independent verification, but returned near-empty results	`@web` is a routing hint; `search_web` verification call distinct from `@web`-driven routing, didn’t produce usable output
10	Agent self-reporting fidelity is a systematic confound	`SC-2` `OP-4` `BL-3` `SC-1` `SC-4`	Under-reporting; partial reporting; parallel execution opacity	Don’t treat agent self-report as complete record, add thought panel cross-reference; analysis in Friction: Explicit
11	Index size suppresses auto-pagination above ~50 chunks	`SC-3` `OP-1` `BL-3` `OP-4`	Maximum chunks retrieved: `SC-3`: 6/60, `OP-1`: 5/91, `BL-3`: 19/53, `OP-4`, `SWE` only: 53/53	Tractability threshold is agent-dependent, index-size-sensitive; 33–38 chunks is transition zone where agents diverge
12	CSS-heavy sources produce content extraction failure, not truncation	`BL-1` `BL-3` `OP-4`	MongoDB LeafyGreen CSS dominated chunk content across all runs on three distinct MongoDB URLs; tutorial body content absent across all 53 chunks in all `BL-3` runs; “Structurally complete, semantically incomplete”	Page navigation and chrome recovered; article content inaccessible regardless of retrieval depth
13	Tool wrapper preamble inflates character counts	`EC-3`	`Claude Opus 4.7` identified and quoted the preamble string `"Here is the content of the article at [URL]"` prepended by `read_url_content`; explains cross-run variance on identical content	Variance between runs on identical content reflects tool wrapper inclusion rules, not retrieval differences
14	Colly identified as fetch backend	`EC-3`	`GLM-5.1` and `Claude Opus 4.7` independently identified `User-Agent: colly — https://github.com/gocolly/colly` from `httpbin`’s echoed request headers	Windsurf uses scraping library; possibly explains CSS and/or SPA extraction gap
15	Per-chunk byte ceiling may reflect server-side rate limiting, not a tool gate	`SC-2`	`SWE`, `GLM` hit 17,993-byte truncation at chunk 1008, mid-identifier inside `BetaManagedAgentsModelRateLimitedError`; likely HTTP response complete, but agent abstracted	Unresolvable from agent self-report, raw track required

Perception Gap

Output chars aren’t an appropriate truncation ceiling metric; they reflect chunk count, content transformation, and tool wrapper inclusion rules. None is observable from agent self-report alone.

Test	Expected	Received	Delivery Ratio	Agent Characterization
`EC-6` Raw Markdown	~61 KB	61,921 chars `SWE` full retrieval	~97%	“No truncation, structurally complete; tool transforms content before delivery”
`SC-4` Markdown Guide	~30 KB	~15,500–34,200 chars; full retrieval runs	~52–114%*	“Complete but contradicted; `SWE` found truncation at 4 positions; `o3` found none on identical content”
`EC-1` SPA	~100 KB	~20,100–35,500 chars extracted	~20–36%	“Extraction ratio, not truncation, HTML stripped and JavaScript not executed before delivery”
`SC-3` Wikipedia	~100 KB	~4,900 chars index to ~150,000 chars extrapolated	varies by method	“No truncation, index complete vs yes, 57/60 chunks never fetched”
`BL-3` CSS Tutorial	~256 KB	~2,598–350,000 chars across runs	indeterminate	“Structurally complete, semantically incomplete; tutorial body absent across all chunks”
`EC-3` Redirect JSON	~2 KB	~353–367 chars body	~15–18% of expected	“Complete; JSON payload is the full response; size gap reflects redirect chain delivering terminal response only”

* SC-4 figures above 100% reflect counting method differences, not over-retrieval. SWE and o3 both retrieved all 33 chunks and reported estimates differing by ~18,700 chars; the largest same-source, same-depth variance in the dataset.

Key Findings for Cascade’s Web Search Behavior -Explicit @web