Agent Ecosystem Testing

Testing Index

Empirical measurement of what happens between “agent fetches URL” and “user sees output” - retrieval mechanism behavior, content transformation, and architectural constraints. Observes the URL-to-response pipeline through layers platforms don’t disclose. Documents output variation with a two-track approach: interpreted captures agent self-perception, raw produces citable data for the Agent-Friendly Documentation Spec.


Blogs

Post Focus
Field Notes from a Yelper: Guerrilla Testing Agents Methodology evolution: what broke,
what changed, and letting data lead

Documentation Structure

Section Purpose
Methodology Testing approach details, track design, constraints
Interpreted vs Raw Observations, implications for agent devs, docs teams
Findings: Interpreted Agentic retrieval, reasoning, reporting
Findings: Raw Agentic write behavior, programmatically extracted metrics
Friction Note Known issues, gaps, or edge cases encountered during testing

Results Summary

More analysis in Platform Comparisons. Platform links lead to testing methodologies.

Platform Key Finding Focus
Anthropic Claude API Char-based truncation at ~100 KB of rendered content Baseline reference; establishing two-track methodology
Anysphere Cursor Agent-routed fetch with undocumented truncation 28 KB–240 KB+;
high cross-session variance
Reverse-engineering opaque, closed consumer tools
Cognition Windsurf Cascade Two-stage chunking-pipeline, no fixed ceiling; retrieval completeness agent and source size-dependent; read-write asymmetry; @web redundant with a URL Three-track design; truncation testing partially documented lossy architecture
Google Gemini API Hard limit: 20 URLs per request; supports PDF and JSON Identifying architectural constraints, format support
Microsoft GitHub Copilot Agent-routed fetch_webpage→relevance-ranked excerpts, no fixed ceiling detected vs curl byte-perfect full retrieval Separating retrieval mechanism from retrieval quality through toolchain visibility
OpenAI Codex testing in progress Surface Comparison
OpenAI Web Search Tool invocation conditional, agent-dependent; differs by API surface Comparing behavior across different APIs