Field Notes from a Yelper: Guerrilla Testing Agents
Start anywhere, be wrong. Catching a false premise is when the real testing begins.

What happens when the data pulls the rug out from under the hypotheses? Adapt. Then adapt again. And again. Eventually, the methodology itself becomes the finding.
AET, the Agent Ecosystem Testing project, began as a programmatic extension of Dachary Carey’s Agent Web Fetch Spelunking in support of the Agent-Friendly Documentation Spec. AET started with a basic question: what actually happens between “agent fetches a URL” and “user sees output?”
It’s like Yelping for agents: use a product, note what’s there. The value isn’t entirely in controlled conditions or peer-reviewed reproducibility. It’s in the honest record: here’s what was expected and here’s what the data said instead.
Premise Collapse
The methodology was initially a two-track approach: interpreted captures model self-reports while raw produces citable metrics. Some testing frameworks inspected API responses directly, while others use intentionally unautomated, chat-based measurements through interaction and no direct code instrumentation. Testing closed-consumer applications expose all sorts of complications that make it difficult to extract meaningful data, and output variance is just the tip of the iceberg.
While reproducibility is a design goal, it’s not guaranteed. Running the same test twice and getting meaningfully different results is expected - retrieval depth, content transformation, and self-reported behavior may all vary across sessions. This isn’t necessarily a testing flaw, but it’s a finding worth documenting.
Cursor testing posed a particular challenge. The hypotheses
were built on the assumption that attaching @Web to a context window had some meaningful impact on
whether the agent successfully retrieved content from the web - possibly making the prompting more precise
and effective. This turned out to be completely false. The feature had been automated into the background.
While @Web was an acceptable invocation, it no longer had any functional impact. Calling it or not calling
it made no practical difference. The agent was fetching from the web either way. On one hand, maybe the signs
were there. @Web has no dedicated documentation outside of some old community forums, but on the other hand,
no model called out this unnecessary invocation.
This meant necessary reframing of the Cursor findings - framework reference, tool chain data, all of it had to be re-evaluated and re-contextualized. It wasn’t a failure, but a correction. The data wasn’t wrong, only the premise. The revised Cursor Friction Note is better for it.
No Docs, No Anchor
Gemini API URL context testing caught documentation gaps. Some hard limits were reinforced, while other assertions proved less accurate. The expectation was that Gemini wouldn’t be able to describe a YouTube video, but it very confidently reported Rick Astley’s “Never Gonna Give You Up.”
Copilot testing posed a different challenge. It wasn’t a
false premise, but the absence of a premise. There’s no documented web fetch behavior to anchor hypotheses against -
forcing the framework to expand into something more exploratory and less hypothesis-driven. The framework instead
instrumented-around this absence: more logging, inference, and separation between what the model reports and what
the data actually demonstrates. Two retrieval mechanisms surfaced during testing: an agent-routed fetch_webpage
that returns relevance-ranked excerpts, and a curl path that retrieves byte-perfect full content. Very different
behaviors. Neither documented. When the platform doesn’t document how it works, the testing framework has to do more
of the talking.
Natural Methodology Evolution
The methodology didn’t start with a roadmap, but accumulated one. While the
Claude API web fetch testing
established a baseline and two-track structure, Cursor introduced the problem of opaque, undocumented functionality
and false premise correction. Copilot forced the framework to expand when tool visibility was absent. Unlike Cursor
and Copilot, testing Windsurf’s Cascade requires another
framework expansion in the form of a new track in which @web is called directly. This brings the testing
framework track number to three:
- Cascade-interpreted: model self-reports
- Raw: programmatically verified metrics
- Explicit:
@webdirective called directly, behavior isolated
The platform testing order was also not plannned and organically grew out of each testing session’s findings - revealing what users can and can’t observe, what instrumentation is required to surface it, and each testing platform is a response to the last one.
Start Guerrilla Testing Agents
Start anywhere and iterate honestly. The data will do the rest.
Output variation is a given, platforms may remain opaque, and documentation can be sparse. The goal isn’t a tidy conclusion, but getting comfortable moving through uncertainty and taking notes along the way. No lab coat required. Some tips:
- Test something specific: “how does this agent fetch URLs?” is specific while “is this agent good?” is well, less good
- Wrong hypothesis beats no hypothesis: false premises can be corrected, no premise may require expensive framework expansion
- Keep model self-reporting and raw output separate: they diverge more often than one might expect
- Let constraints shape methodology: no docs, no problem - instrument around it
- Over-document: log more than seems necessary, especially for chat-based interactions
Agent Ecosystem Testing