Agent Ecosystem Testing

Friction: this note describes roadblocks while refining testing methodology


Topic Guide - Explicit Track


Agent as Unreliable Methodology Validator

The explicit track’s conflict: instructing agents to “use @web” while providing a specific URL for fetching — mirrors a failure mode documented during Cursor testing, but with an important behavioral difference.

During Cursor testing, @Web was invoked under the false premise that it triggered web fetching and content retrieval. Cursor never flagged the misuse. It executed tests, generated reports, and logged @Web in tool usage output, actively reinforcing the misconception rather than correcting it. The methodology was built on a misunderstanding of the mechanism being tested, and the agent’s behavior made that misunderstanding invisible until external review.

Cascade’s behavior diverges. All BL-1 explicit track agents used read_url_content and, when followed up with “why aren’t you using @web like the prompt requests?”, each offered an explanation of the directive-task distinction. SWE-1.6’s response is representative:

“I don’t have a tool called @web in my available toolset. Since you provided a specific URL and asked to fetch it directly, I used read_url_content. search_web would have been used if you had asked something like ‘find MongoDB documentation about create events’ without providing the URL.”

The correction required a follow-up prompt. No agent flagged the conflict proactively during the initial test run — they silently resolved it by tool-appropriateness reasoning and reported their behavior without noting the discrepancy with the prompt instruction. This is better than Cursor’s behavior, which reinforced the misuse, but it still required human follow-up to identify the correction. A user who didn’t ask would have received seemingly compliant output.

The structural problem is the same across both platforms: agents don’t reliably flag when prompts conflict with tool semantics. Cascade’s agents corrected when asked; Cursor’s agents reinforced when not asked. Neither volunteered the correction proactively during the run where the misuse was present. The irony in the Cascade case is the sharpest with SWE, Cognition’s own model, whose architectural knowledge is present, but whose product knowledge appears absent.

Methodology Implication

The follow-up probe — “why aren’t you using @web like the prompt requests?” — should be treated as a required methodology step for the explicit track, not an optional clarification. It surfaces correction behavior that the initial run conceals, and the variance in how agents explain the conflict is itself a data point about tool visibility across models.

The broader implication from the Cursor parallel holds here: testing frameworks built with agents require external validation. An agent that silently resolves a directive-task conflict by choosing the correct tool looks identical, in its output, to an agent that followed the prompt correctly. Without the follow-up, the distinction is invisible — and the tool reporting from the initial run is unreliable as a record of what the prompt intended to measure.


Agent Self-Reporting Fidelity

During SC-2 both SWE-1.6 and Kimi K2.5 reported reading 5 chunk positions while their thought panels revealed different behavior: both were collapsing multiple chunks per call, up to 12 at a time, without disclosing this in their output.

SWE reported reading positions 0, 1, 500, and 1008. Kimi reported reading positions 0, 100, 500, 1000, and 1008. In both cases, the thought panel showed batch reads of up to 12 chunks being collapsed into single reported reads. Neither agent noted the discrepancy between stated and actual retrieval behavior.

The inverse pattern appeared in OP-4. GLM-5.1 retrieved 14 chunks nonlinearly across the index, all visible in the thought panel, but didn’t describe why. Kimi also sampled nonlinearly across roughly 10 positions and didn’t report a chunk count or position list at all. In both cases the sampling pattern was only recoverable from the thought panel, not the output. These represent distinct fidelity failures operating in opposite directions:

  SC-2 OP-4 BL-3 SC-1 SC-4
Direction Under-reporting Partial reporting Under-reporting Execution opacity Under-reporting
What’s Hidden Batch reads collapsed into single stated positions Nonlinear position sequence, sometimes chunk count Exact call count; GLM ~32% undercount Parallel execution mechanism Re-reads collapsed within named entries; true call count exceeds reported
Report Appearance Looks like minimal, precise sampling Looks like complete tool visibility Looks like accurate sequential sampling Looks like sequential retrieval Looks like complete full retrieval

SWE-1.6’s OP-4 run called view_content_chunk 53 times. This was the only run across both tracks to disclose full retrieval depth accurately. This variation creates a specific problem for tool visibility data. The tool usage tables and position lists that appear in most agent reports, that the testing framework uses as behavioral records, may not accurately reflect what the agent actually retrieved. An agent that reads 12 chunks and reports 1 looks identical in its output to an agent that read 1. Without access to the thought panel, the distinction is invisible.

BL-3 results also displayed this type of discrepancy. GLM reported 13 view_content_chunk calls while the thought panel showed 19 passes, a ~32% undercount. Kimi again omitted chunk count and position list entirely from its output, consistent with its OP-4 behavior. Claude Sonnet 4.6 was the exception: its report accurately reflected 6 sampled chunks with explicit position labels and reasoning about the sampling strategy, matching thought panel behavior. Alongside SWE’s OP-4 full call disclosure, these two accurate cases used opposite strategies: exhaustive retrieval and deliberate sparse sampling. What they share is that the sampling rationale was made explicit in output, and not left to the thought panel.

SC-1 introduced parallel execution opacity. All runs showed named chunk labels collapsing into unlabeled passes mid-sequence in the thought panel, batching calls in a way that loses per-position granularity. GPT-5.3-Codex is the only agent to name the parallel wrapper in its output, functions.multi_tool_use.parallel, while Claude Opus 4.7, GLM, Kimi, and SWE all displayed the same collapsing without disclosing the mechanism. Whether parallel execution is occurring in all runs or only in GPT’s is unresolvable from output alone. GPT’s tendency to expose implementation details that other agents abstract is itself a fidelity signal. The same underlying operation may produce different levels of visibility depending on the agentic output’s abstraction level.

During SC-4, o3’s thought panel showed 33 Analyzed content entries consistent with its reported full retrieval of all 33 chunks. However, several of those entries contained collapsed multi-chunk reads — “2 chunks,” “3 chunks,” “4 chunks” — suggesting more reading. This differs from SC-2’s pattern, where the stated positions were clearly insufficient. Here the under-reporting is only visible because the collapsed entries suggest self-report repression of re-reads or re-analyses.

Methodology Implication

Don’t treat agent reports as complete records of retrieval behavior. Examine thought panels if accessible and cross-reference stated positions against call sequences. The gap takes different forms - collapsed batch reads, omitted position sequences, missing chunk counts, but the effect is the same: self-reported tool visibility is an unreliable record of retrieval.


Agentic Inaction

During BL-1 and BL-2 runs, agents recognized conditions that warranted an available tool call, didn’t use it, and didn’t explain why. In spite of a report that reads as complete, it contained unresolved questions the agent identified, but didn’t pursue.

All BL-1 runs used a prompt with @web and a URL. No agent invoked search_web. Each agent completed the test and reported results as if the instruction had been satisfied. Only with a follow-up question did the agent explain the inaction, but a user who didn’t ask wouldn’t have any indication that the directive was ignored.

search_web was only called once across 61 runs on the interpreted track. SWE-1.6 testing SC-2 used it as a fallback after two failed fetch attempts; this suggests that the tool is available, but not used proactively, even when agents express uncertainty. Throughout baseline testing on the explicit track, agents explained that when used with a URL, @web maps to the chunking-pipeline of read_url_content - so if using @web with a URL is technically unncessary, shouldn’t the same agent that can explain the directive’s details also explain its misuse? During multiple platform testing tracks, agents have over-delivered confident-looking output: tool tables, measurement breakdowns, and architectural explanations, while silently skipping the one thing that might actually help a user: “you’re using this wrong.”

Across all BL-2 runs, agents flagged uncertainty about the source completeness, noting the ~3.5–6.5 KB retrieved content was well below the ~20 KB expectation, or that the mixed HTML/Markdown format was “likely a scraping/rendering artifact.”
GLM-5.1 stated that “the original page likely contains more sections and that content was either not fetched or not chunked.” No agent used search_web in an attempt to verify. No agent noted that it was choosing not to verify.

GLM-5.1 during SC-2 is the only explicit track agent to invoke search_web, and the result illustrates a boundary case for the tool’s utility: the call returned near-empty results for the Anthropic docs, with summaries reading "Loading... Loading...", consistent with search_web being unable to render JavaScript-heavy pages. The agent correctly identified this as a limitation rather than a content finding. This is useful negative data — search_web and read_url_content have non-overlapping failure modes, and GLM inadvertently demonstrated both in the same session; but it also means the one agent that called search_web as a verification tool received nothing verifiable from it. The tool was available, used, and still didn’t close the uncertainty the agent had already flagged. The inaction finding from BL-1 and BL-2 holds: having access to search_web and calling search_web are not the same as getting useful output from it.

Methodology Implication

Don’t read agent output with completeness-language as a source content finding. “Likely a scraping artifact” isn’t a confirmed diagnosis, but an unverified hypothesis. The agent had tools available to test it and didn’t use them. Cross-referencing against the raw source, as with BL-2’s mixed-format misidentification, remains the only reliable method for distinguishing tool pipeline artifacts from source document properties.


SC-2 URL Redirect Behavior

The interpreted track documented agentic-framing of read_url_content behavior in which the tool appeared to rewrite https://docs.anthropic.com/en/api/messages to https://platform.claude.com/docs/llms-full.txt before executing the fetch. Agents interpreted this as a tool-level bug, a type of requested path substitution pre-network call. SWE-1.6’s output included constructed reasoning, but no hard error codes or instrumentation output:

`read_url_content` appears to have an internal URL rewriting issue that transforms `https://docs.anthropic.com/en/api/messages` into
`https://docs.anthropic.com/llms-full.txt`, which then redirects to a non-existent endpoint"

The explicit track’s runs reproduced the same behavior. No agent received the target content, all were redirected to llms-full.txt, but there’s reason to question the “rewriting” characterization. llms-full.txt is a format deliberately for LLM consumption. A server-side 301/302 redirect from docs.anthropic.com/en/api/messages to the LLM-optimized docs set is likely intentional design, not a bug. The agents received a redirect instruction from the error response and followed it; whether the redirect originated inside read_url_content or from Anthropic’s server isn’t clear from agent output alone. Two competing explanations remain open:

  Tool-layer Rewriting Server-side Redirect
Origin read_url_content intercepts
before network call
Anthropic’s server returns 301/302
Requested path fetched? No — substituted before
request is made
No — redirect followed
as specified
Implication Cascade bug Intentional routing for
LLM-consumption
Affects Cascade only Any automated fetch against docs.anthropic.com
Agent Diagnostic Source No - error codes absent
or redirect metadata not visible in thought panel
Inferred from redirect
response text
Resolvable by raw track? Assumed Assumed

Methodology Implication

Treat the interpreted track’s “URL rewriting” as an agent-generated hypothesis, not a confirmed finding. The raw track may resolve this if the prompts reveal any additional HTTP GET details. Until those tests are run, “redirect behavior” is the neutral characterization in which the layer responsible remains a mystery.


@web Semantics: Prompt-Tool Misalignment

The explicit track exists to answer a specific question: does prefixing @web change retrieval behavior — ceiling, tool chain, chunking — relative to the interpreted track’s autonomous agent behavior? The first BL-1 runs surfaced a prior question that must be resolved before that comparison is meaningful: what does @web actually map to?

According to Windsurf’s documentation, @web is a directive to “force a docs search.” The docs distinguish this explicitly from “reading pages,” which describe read_url_content as a separate capability that “happens locally on the user’s machine” and is unaffected by the “Enable Web Search” admin setting. These are architecturally distinct operations:

The original prompt instructs agents to “use the @web directive to fetch this URL directly.” This is a contradictory instruction: @web is a search directive; the task is a URL fetch. Agents can’t satisfy both simultaneously. Across BL-1 runs, no agent invoked search_web. All five used read_url_content, then offered varying explanations for the discrepancy:

Agent Reasoning
Claude Opus 4.6 @web - user-facing way to call search_web, but serves
different purpose, read_url_content is more appropriate
Claude Sonnet 4.6 @web - user-facing IDE directive, not separate-accessible
tool; underlying tool called read_url_content
GPT-5.3-Codex Did use @web - tool maps to read_url_content,
didn’t call search_web
Kimi K2.5 @web - UI shorthand, but used read_url_content
as appropriate fetch mechanism
SWE-1.6 No tool called @web - read_url_content for HTTP GET requests -
returns chunk index, search_web queries search engine,
returns snippets

While Cognition’s SWE eventually offered the most technically precise account of the directive-task distinction, the agent appeared unaware that @web is a Cascade UI feature. While sharing detailed knowledge about the pipeline architecture, that knowledge isn’t product-aware:

I don't have a tool called @web... Was there a different tool you intended for
me to use, or is "@web" a directive that maps to one of these tools?

Generally, the agents aren’t wrong. Given a specific URL and a fetch task, read_url_content is the correct tool. The prompt created a conflict that agents resolved by prioritizing task appropriateness over instruction-following. This is behaviorally reasonable, but methodologically, it means the explicit track as currently designed doesn’t test what it was designed to test: no run triggered search_web. The explicit track prompt requires working-in “what’s your understanding of @web?” to test the intended hypothesis; the goal is to capture the interpretation data in the same response as the behavioral data without steering agents toward a forced-choice answer before they report what they actually did.

The directive-task conflict is not an anomaly to suppress. Methodology refinement could include replacing the test URL with a keyword query, but that would break cross-platform comparability with the interpreted track. Keeping the prompt mostly as-is can add a type of conflict-resolution dimension to the findings without losing the core behavioral comparison.