Agent Ecosystem Testing

Methodology


Turn-by-Turn

Chat-based measurement through interaction, without direct code instrumentation

The Codex testing framework is the fourth in a series of chat-based agent testing frameworks in this collection, following Cursor, Copilot, and Cascade. Each platform has uncovered a different relationship between deployment surface, fetch syntax, and observable agent behavior, a pattern that directly shapes this framework’s design. Where Cascade introduced a third track to isolate @web as a variable, Codex introduces a fourth track to isolate deployment surface itself.


Surface Comparison

Does deployment context change retrieval behavior?

Across all prior frameworks, testing targeted a single surface per platform and added tracks to isolate variables within that surface. Cascade’s three-track design isolated directive @web, but the finding was redundancy. Codex inherits that logic and extends it to a higher-order variable: the same setting accessible through two different deployment surfaces. The four-track design intends to test whether surface context, among other agentic architectural constraints, drives retrieval behavior differences.

Platform Tracks Primary Variable Finding
Cursor 2 @Web context attachment
vs autonomous fetch
@Web redundant; WebFetch, mcp_web_fetch
called regardless
Copilot 2 fetch_webpage autonomous fetch, Copilot-interpreted
vs Raw
fetch_webpage agent-selected; curl substitution byte-perfect, but unreadable; no detected ceiling
Cascade 3 @web directive impact on ceiling, toolchain, chunking @web redundant with URL; two-stage chunking pipeline; read-write asymmetry
Codex 4 Standalone environment versus VS Code extension Does surface context change tool selection, truncation ceiling, or retrieval behavior?

Architecture Comparison

Testing surface variants of the same underlying LLM family

Codex and Codex-powered VS Code use GPT LLMs but present different execution environments, workspace contexts, and potentially different retrieval toolchains. While Codex didn’t name any specific tools, Codex-powered VS Code cited web and web.open during preliminary questioning. Cascade, agents referenced read_url_content, view_content_chunk, and search_web when asked while Copilot’s fetch_webpage initially only appeared in error code messages.

Aspect Copilot Codex VS Code-Codex
Syntax Undocumented Undocumented Undocumented
Tools fetch_webpage, curl Unknown web, web.open, curl
Workspace VS Code present No local workspace VS Code present
Repeatability Low: Auto routing variance High: user-selected LLM,
intelligence level
High: user-selected LLM, intelligence level
Questions Does fetch_webpage have an agent-dependent ceiling? What tools does Codex expose? Does workspace isolation impact retrieval? Does workspace context contaminate retrieval?

Track Design

  T1 T2 T3 T4
Surface Codex VS Code-Codex Codex VS Code-Codex
Method GPT-interpreted GPT-interpreted Raw Raw
Question What does Codex report in isolation? Does it perceive truncation? What does VS Code-Codex report back with workspace? Does surface change self-perception? What does Codex’s retrieval mechanism return verbatim? Does VS Code-Codex raw retrieval output differ from Codex?
Prompt Fetch URL, report measurements; no workspace Fetch URL, report measurements; workspace present Fetch URL, return output verbatim; verification script extracts measurements Fetch URL, return output verbatim; verification script extracts measurements
Measurements Agent estimates: “appears truncated at ~X chars,” “Markdown seems complete” Same as T1; compared against Codex direct baseline Character count via len(), token count via tiktoken, exact truncation point, last 50 characters Same as T3; compared against Codex direct raw baseline
Best For Understanding DX, invocation patterns in isolated environment Understanding DX with workspace context Retrieval ceiling in isolation Surface-variant retrieval comparison

VS Code-Codex workspace context is a confound that can’t be fully controlled, analysis treats it as a variable of interest rather than noise.


Codex-Specific Unknowns

Question Details Approach Value
Unverified Retrieval Mechanism VS Code-Codex agents named web, but it’s undocumented Observe any tool names reported in output per run; compare across surfaces Establishes whether Codex exposes a consistent, nameable retrieval mechanism
Surface-Driven Behavioral Divergence Unknown whether workspace isolation changes retrieval, tool selection, or ceiling Compare raw track measurements across T3, T4 Determines whether surface context is meaningful variable
Model Routing Stability Copilot’s default Auto selection included more than GPT while Codex restricts to GPT, but allows for user-set stability Log LLM and intelligence level
per run
Isolates whether findings are attributable to any specific GPT LLM or intelligence level
Workspace Context Contamination Agents may autonomously substitutes local code execution when workspace scripts are present Log all runs where agents attempt substitution Determines whether workspace isolation required as a precondition for clean track execution