Agent Ecosystem Testing

Methodology


  1. Chat-based measurement through interaction, without direct code instrumentation

    The Cascade testing framework is the third in a series of chat-based agent testing frameworks in this collection, following Cursor and Copilot. Each platform has surfaced a different relationship between user-facing fetch syntax and actual agent behavior — a pattern that directly shapes this framework’s design.

  2. An evolving relationship between fetch syntax and agent behavior

    Across three platforms, the role of explicit web fetch directives has shifted in a consistent direction. This pattern — explicit syntax → autonomous behavior → documented-but-effect-unknown — is the central methodological question Cascade testing inherits from the two prior frameworks. The three-track design below exists specifically to isolate @web as a variable, rather than assume its effect in either direction.

    Platform User-Facing Syntax What Testing Revealed
    Cursor @Web context attachment Direct invocation was unnecessary as capability had become autonomous by the time testing began; backend mechanisms WebFetch, mcp_web_fetch invoke regardless of @Web syntax
    Copilot None documented No user-invocable syntax exists; testing surfaced the undocumented fetch_webpage tool from agent output
    Cascade @web directive, documented Does invoking @web change retrieval behavior —
    ceiling, tool chain, chunking?
  3. Testing a closed consumer application vs an open API

    Rather than target specific API endpoints with documented interfaces, Cascade testing targets a consumer application with proprietary chat behavior and a partially documented tool layer. Cascade’s web fetch implementation surfaces three named tools — read_url_content for direct URL fetch, view_content_chunk for paginating large documents via DocumentId, and search_web for query-based lookup — reported by Cascade itself during runs. While these tools are referenced in documentation, they don’t include many details. Compare to Claude API testing, in which fetch behavior is directly inspectable via tool_result.

    Aspect Cursor Copilot Cascade
    User Fetch Syntax @Web context attachment None @web directive
    Tools Observed WebFetch, mcp_web_fetch fetch_webpage
    and/or curl
    read_url_content, search_web, view_content_chunk
    Repeatability Medium Low — model routing variance Low — approval interaction may
    affect routing
    Questions Does MCP override @Web?
    Does agent auto-chunk?
    Does fetch_webpage have a consistent ceiling? Does it vary by model? Does @web change the ceiling, tool chain, or chunking behavior?
  4. Measuring with three complementary tracks

      Interpreted Track Raw Track Explicit Track
    Question What does Cascade report back without steering? Does it accurately perceive truncation? What does read_url_content actually return?
    Where exactly does truncation occur?
    Does adding @web change truncation limits, tool chain, or chunking behavior?
    Method Fetch URL, report measurements;
    no @web
    Fetch URL, return output verbatim; no @web; verification script extracts measurements Identical to interpreted track, prefixed with @web
    Measurements Model estimates: “appears truncated at ~X chars,” “Markdown seems complete” Character count via len(), token count via tiktoken, exact truncation point,
    last 50 characters
    Same as interpreted track; compared against implicit baseline
    Repeatability Low — approval interaction may affect routing Medium — same URL should yield consistent read_url_content output Medium@web may stabilize tool selection
    Best For Understanding DX; surfacing approval-gated fetchdoes approval-gating affect routing consistency? Auto-pagination behavior - does view_content_chunk paginate automatically, or only when prompted? @web effect on ceiling - does @web change or repeat the Cursor finding that the directive is redundant?

    Known limitations: interpreted and explicit tracks vary between runs; read_url_content requires approval before fetch executes — approval interaction itself may influence routing and logged per run; view_content_chunk pagination via DocumentId only partially observable through model output