Agent Ecosystem Testing

Testing Index

Empirical measurement of what happens between “agent fetches URL” and “user sees output” — retrieval mechanism behavior, content transformation, architectural constraints — where the fetch-to-output pipeline may pass through multiple opaque layers and for platforms that don’t document these details. Implements a two-track approach: interpreted captures model self-perception and output variance, and raw produces citable data for the Agent Docs Spec.


Blogs

Post Focus
Field Notes from a Yelper: Guerrilla Testing Agents Methodology evolution: what broke, what changed, and letting data lead

Testing Documentation Structure

Section Purpose
Methodology Testing approach details and constraints
Interpreted vs Raw Two-track values and measurements
Findings: Interpreted What the model reports vs what it received, run variation
Findings: Raw Metrics extracted programmatically - reproducible, spec-ready
Friction Note Known issues, gaps, or edge cases encountered during testing

Results Summary

Platform Key Finding Focus
Anthropic Claude API Character-based truncation at
~100KB of rendered content
Baseline reference; establishing two-track methodology
Anysphere Cursor Agent-routed fetch with undocumented truncation - 28KB–240KB+;
high cross-session variance
Reverse-engineering opaque, closed consumer tools
Cognition Windsurf Cascade Testing in progress Reverse-engineering partially documented,
closed consumer tools
Google Gemini API Hard limit: 20 URLs per request;
supports PDF and JSON
Identifying architectural constraints and format support
Microsoft GitHub Copilot Agent-routed fetch_webpage - relevance-ranked excerpts, no fixed ceiling detected and/or curl - byte-perfect full retrieval Separating retrieval mechanism from retrieval quality through tool-call visibility
OpenAI Web Search Tool invocation conditional and model-dependent; differs by API surface Comparing behavior across
API endpoints