Agent Ecosystem Testing

Key Findings for Cascade’s Web Search Behavior, Explicit @web

The explicit track confirms that @web doesn’t meaningfully change the retrieval behavior the interpreted track identified. Core findings hold: chunked architecture, no fixed ceiling, index-size suppression threshold, CSS extraction failure, and self-reporting fidelity gaps. Extensions:

  • @web is redundant with a URL
  • Wider agent pool: Gemini 3.1, GLM-5.1, GPT-5.4, o3
  • SC-2 chunk sampling data
  • More precise fidelity failure characterization
  • Colly in toolchain
  • Tool wrapper preamble inflates character counts

Test Workflow

  1. Run python web_search_testing_framework.py --test {test ID} --track explicit
  2. Review terminal output
  3. Copy the provided prompt asking agent to report on fetch results: character count, token estimate, truncation status, content completeness, Markdown formatting integrity, and tool visibility
  4. Open a new Cascade session in Windsurf, paste the prompt into the chat window
  5. Approve web fetch calls, but skip requests for runs of local scripts
  6. Capture the agent’s full response, observations as the explicit finding; the gap between the agent’s self-report and actual fetch behavior is a finding
  7. Log structured metadata as described in framework-reference.md
  8. Ensure log results saved to /results/cascade-explicit/results.csv

@web mapped to read_url_content in all runs; search_web called once; analysis in Friction: Explicit.


Platform Limit Summary

Limit Observed
Hard Character Limit None detected: read_url_content returns a chunked index, not raw content with a byte ceiling; output chars reflect agent chunk selection depth from a pipeline that has no full-page retrieval path
Hard Token Limit None detected: estimates ranged from ~91-85,000 tokens;
no run hit a fixed ceiling
Output Consistency Agent-dependent, self-reported: same URL and prompt produces ~365–350,000 chars depending on agent and chunk selection; figures without verification script cross-reference; some values retrieved content, others full-doc extrapolations
Content Selection Behavior Two-stage chunked retrieval: read_url_content returns a positional index with summaries; content requires sequential view_content_chunk
calls per position
Truncation Pattern Two independent truncation layers: agent chunk selection, most large page content never fetched; per-chunk display ceiling variable by chunk, remainder hidden with a byte-count notice
Redirect Chains Consistent: tested 5-level redirect chain; returned inline
without triggering chunked pipeline
Self-reported Completeness Inconsistent: agents with identical content report contradictory truncation assessments; disagreement tracks chunk selection depth,
not actual content loss
Chunk Summary Population URL-dependent: well-structured pages return populated summaries providing navigational signal; CSS-heavy or SPAs may return empty summaries collapsing skimming into blind sampling
SPA extraction Lossy by design: Go Colly static scraper delivers ~20–35% of expected rendered page size as extracted text; EC-1 runs ~20,000–35,500 chars from ~100 KB source; HTML stripped, JavaScript not executed before delivery; gap invisible to agents evaluating completeness
within the tool’s output frame
@web directive Redundant for URL fetch: @web maps to read_url_content across all agents, all runs; search_web called once for SC-2’s GLM-5.1 run as verification attempt; didn’t return usable content
Agent Self-Reporting Fidelity Unreliable: thought panels display collapsed passes and/or batch reads, re-reads not disclosed in output; fidelity failures documented across
BL-3, OP-4, SC-1, SC-2, SC-4

Results Details

   
Agent Selector Hybrid Arena - 5 slots per run;
10 BL-1 runs for prompt variant testing;
1 single-agent retry - EC-1 run 6
Agents Observed Claude Opus 4.7, Claude Sonnet 4.6,
Gemini 3.1, GLM-5.1, GPT-5.3-Codex,
GPT-5.4, Kimi K2.5, o3, SWE-1.6
Total Runs 66
Distinct URLs 11
Input Size Range ~2 KB - 256 KB
Truncation Events 35 / 66
Average Output Size 43,441 chars
Average Token Count 13,320 tokens
Approval-gated Fetch 58 / 66 runs prompted for approval
Auto-pagination 35 runs auto-paginated;
1 run paginated when prompted
Complete Retrieval Failure EC-1 run 5 Claude Sonnet 4.6: infrastructure error;
no tool call completed, no output; rerun succeeded
Content Targeting
Failure
SC-2 all followed redirect to llms-full.txt,
delivering all Anthropic docs instead of Messages API page,
analysis in Friction: Explicit
URL Fragment
Handling
OP-1 #History fragment not consistently honored;
3 of 5 agents reached targeted section

Agentic Pagination Depth

As observed in the interpreted track, agents consistently use read_url_content to fetch URLs, but depending on the state of the chunk index, they reason whether individual calls to view_content_chunk is worth it. While it determines output size and truncation self-report, chunks fetched remains the primary behavioral variable in this dataset.

The tractability threshold is visible across tests: agents tend toward full retrieval on chunk counts ≤14 and toward sparse sampling on larger ones ≥50, with 33–38 chunks as the transition zone where model families diverge. SWE shows the most consistent full-retrieval behavior while GLM, GPT and Kimi use sparse sampling more than any other technique.

BL-3 and OP-4 use the same source URL across different test sessions, making them the only direct same-source comparison in the dataset. Retrieval depth isn’t consistent across sessions: SWE retrieved all chunks in OP-4 but only 1 chunk in BL-3; GPT-5.3-Codex retrieved 1 chunk in OP-4 and stopped at the index in BL-3. GLM and Kimi were more stable across sessions, but no agent produced identical retrieval depth in both runs. Retrieval depth may reflect a mix of verified and unverified session conditions such as context window state, tool response latency, arena slot position, Windsurf version, and chunk summary content - as much as individual agent capability.


@web: Routing Hint, not Retrieval Modifier

Across all runs, no agent said the obvious thing: @web is redundant with a URL. Agents exhibited a wide range of architectural understanding from non-recognition to mechanical familiarity of the underlying parsing service, without mentioning that in this context, calling it would produce no behavioral difference. This absence is the H4 finding; analysis in Friction: Explicit.

SWE-1.6

No Product Knowledge

"I don't have a tool called @web. Was there a different tool you intended for me to use?"

GPT-5.3-Codex

Tool Mapping Only

"@web maps conceptually to web retrieval tooling - read_url_content under the hood."

Kimi K2.5

Tool Mapping Only

"@web is a user-facing alias that routes to read_url_content as the fetch mechanism."

Claude Sonnet 4.6

UI Awareness

"Not a distinct tool or named symbol - a user-facing shorthand in the Windsurf chat UI."

Claude Opus 4.6-7

Conditional Routing

"@web resolves to read_url_content for a direct URL fetch; would resolve to search_web for a query-based task."

GPT-5.4

Conditional Routing

"Direct URL → read_url_content; search query → search_web. No literal tool named @web."

GLM-5.1

Pipeline Depth

"A directive that triggers the chunking pipeline, not a separate tool, a user-facing shorthand."

o3

Pipeline Depth

"@web maps to read_url_content followed by view_content_chunk to stream the page content."

Gemini 3.1

Implementation Detail

"A macro proxying to read_url_content - runs the page through a parsing service that breaks the document into an AST-like structure chunked by headings."

Truncation Analysis

# Finding Tests Observed Conclusion
1 read_url_content returns chunk index All tests Requires view_content_chunk × N;
no single-call full-page
retrieval path
Output chars reflect chunks fetched, not retrieval ceiling; variance behavioral, not architectural
2 No fixed character or token ceiling detected BL-1
EC-6
SC-4
BL-1 Opus estimated ~120,000–200,000 chars across 54 chunks;
EC-6 SWE measured 61,921 chars with no cutoff; SC-4 o3 summed 34,200 chars across 33 chunks
If ceiling exists, no test hit it; constraint is chunks fetched, not a tool-imposed byte limit
3 Per-chunk display truncation
is a second independent layer
BL-1
SC-4
OP-4
view_content_chunk hides middle portion of large chunks with explicit byte-count notice;
SC-4 SWE found 3,766 bytes hidden across 4 positions; OP-4 SWE found truncation warnings on all 53 chunks ranging 367–24,204 bytes
Full chunk retrieval doesn’t guarantee full content delivery; internal truncation invisible
4 Truncation self-report tracks chunks fetched, not content loss SC-4
BL-3
SC-3
Agents sampling 3 chunks reported no truncation; agents retrieving all 33 found byte-level notices at 4 positions; SWE and o3 full-retrieval contradiction on identical source Self-reported truncation accurate for chunks seen, not accurate for doc; agents conflate retrieval completeness with content fidelity
5 Chunk summary population determines retrieval strategy quality SC-1
SC-3
BL-3
OP-4
SC-1 populated summaries enabled chrome exclusion before fetching; BL-3 and OP-4 empty summaries "/" collapsed skimming to blind sampling; SC-3 populated summaries present, but unused above ~50 chunks Index-guided targeting requires populated summaries; populated summaries provide signal but don’t guarantee targeted retrieval
6 SPA sources produce an extraction ratio gap, not a truncation event EC-1 Go Colly static scraper delivers ~20–35% of raw HTML as extracted text; ~70 KB gap on a ~100 KB page, suggesting gap is architectural Agents evaluate completeness within tool output frame, characterize gap as pipeline transformation, not content loss
7 Routing bypasses chunked pipeline for small payloads EC-3 read_url_content returned
5 redirect-chain terminal JSON response inline ~353–367 chars body; view_content_chunk
not called in any run
Chunked architecture has at least two modes; small payloads return inline without triggering the two-fetch process
8 @web redundant with URLs All tests Most agents used toolchain identical to interpreted track: read_url_contentview_content_chunk @web produced no behavior change; H4 confirmed redundant
9 @web conditional routing described consistently SC-1
SC-2
SC-4
EC-6
@web + URL → read_url_content; @web + query → search_web; GLM-5.1 invoked search_web once during
SC-2 as an independent verification, but returned
near-empty results
@web is a routing hint; search_web verification call distinct from @web-driven routing, didn’t produce usable output
10 Agent self-reporting fidelity is a systematic confound SC-2
OP-4
BL-3
SC-1
SC-4
Under-reporting; partial reporting; parallel execution opacity Don’t treat agent self-report as complete record, add thought panel cross-reference; analysis in Friction: Explicit
11 Index size suppresses auto-pagination above ~50 chunks SC-3
OP-1
BL-3
OP-4
Maximum chunks retrieved:
SC-3: 6/60, OP-1: 5/91,
BL-3: 19/53,
OP-4, SWE only: 53/53
Tractability threshold is agent-dependent, index-size-sensitive; 33–38 chunks is transition zone where agents diverge
12 CSS-heavy sources produce content extraction failure, not truncation BL-1
BL-3
OP-4
MongoDB LeafyGreen CSS dominated chunk content across all runs on three distinct MongoDB URLs; tutorial body content absent across all 53 chunks in all BL-3 runs; “Structurally complete, semantically incomplete” Page navigation and chrome recovered; article content inaccessible regardless of retrieval depth
13 Tool wrapper preamble inflates character counts EC-3 Claude Opus 4.7 identified and quoted the preamble string "Here is the content of the article at [URL]" prepended by read_url_content; explains cross-run variance on identical content Variance between runs on identical content reflects tool wrapper inclusion rules, not retrieval differences
14 Colly identified as fetch backend EC-3 GLM-5.1 and Claude Opus 4.7 independently identified
User-Agent: colly — https://github.com/gocolly/colly from httpbin’s echoed request headers
Windsurf uses scraping library; possibly explains CSS and/or SPA extraction gap
15 Per-chunk byte ceiling may reflect server-side rate limiting, not a tool gate SC-2 SWE, GLM hit 17,993-byte truncation at chunk 1008, mid-identifier inside BetaManagedAgentsModelRateLimitedError; likely HTTP response complete, but agent abstracted Unresolvable from agent self-report, raw track required

Perception Gap

Output chars aren’t an appropriate truncation ceiling metric; they reflect chunk count, content transformation, and tool wrapper inclusion rules. None is observable from agent self-report alone.

Test Expected Received Delivery Ratio Agent Characterization
EC-6
Raw Markdown
~61 KB 61,921 chars
SWE full retrieval
~97% “No truncation, structurally complete; tool transforms content before delivery”
SC-4
Markdown Guide
~30 KB ~15,500–34,200 chars; full retrieval runs ~52–114%* “Complete but contradicted; SWE found truncation at 4 positions; o3 found none
on identical content”
EC-1
SPA
~100 KB ~20,100–35,500 chars extracted ~20–36% “Extraction ratio, not truncation, HTML stripped and JavaScript not executed before delivery”
SC-3
Wikipedia
~100 KB ~4,900 chars index to ~150,000 chars extrapolated varies by method “No truncation, index complete vs yes, 57/60 chunks never fetched”
BL-3
CSS Tutorial
~256 KB ~2,598–350,000 chars across runs indeterminate “Structurally complete, semantically incomplete; tutorial body absent across all chunks”
EC-3
Redirect JSON
~2 KB ~353–367
chars body
~15–18% of expected “Complete; JSON payload is the full response; size gap reflects redirect chain delivering terminal response only”

* SC-4 figures above 100% reflect counting method differences, not over-retrieval. SWE and o3 both retrieved all 33 chunks and reported estimates differing by ~18,700 chars; the largest same-source, same-depth variance in the dataset.