Agent Ecosystem Testing

ChatGPT-interpreted vs Raw


Track Design

ChatGPT-interpreted captures what gpt-4o-mini-search-preview believes it retrieved: how many sources it consulted, whether results felt current, how it characterizes search depth. This is the model’s self-report. The raw track captures what the API actually returned: exact tool_invoked flags, exact source lists from web_search_call.action.sources, exact token counts from response.usage. These are Python len() calls and dictionary lookups, not model estimates.

The gap between these two tracks is itself a finding. If the interpreted track reports “12 distinct sources” but the raw source_count is 1, that discrepancy belongs in the spec.

  web_search_test.py web_search_test_raw.py
API Chat Completions Responses
Measures Model’s interpretation of
what it retrieved
Raw metadata extracted directly
from response object
Search Invocation Implicit - model always searches
no visibility
Explicit web_search_call
item in response.output
Source Counts Model self-report
frequently overstates
Python len() on web_search_call.action.sources
Citation Counts url_citation annotations in message.annotations Not applicable, raw track doesn’t
count inline citations
Internal Query Not exposed action.query string from web_search_call item - exact
max_output_tokens Not set, model writes full assessments Set to 256 - minimal output,
metadata is the signal
Token Cost per Run Higher, model writes long self-assessments Lower, minimal prompt,
capped output
Domain Filtering Not available
Chat Completions API only
Available on web_search tool,
non-functional as tested
Sources List Not available
Chat Completions API only
Available via include=["web_search_call.action.sources"]
Best For Understanding what the model perceives it retrieved Citable measurements
for the spec

Agent-Friendly Docs Spec

The following are appropriate additions to the spec’s Known Platform Limits table:

Tool Invocation

  • Tool invocation is conditional and deterministic for unambiguous query types: static facts and trivial math were never searched across all raw track runs; live data and research queries always invoked the tool. Behavior was consistent across all 3 complete raw runs.
  • In the interpreted track, tool invocation is implicit and not observable. citation_count == 0 doesn’t mean search wasn’t performed - test_1_live_data returned 0 citations in 2/3 runs while still producing accurate live BTC prices.

Citation and Source Counts

  • Citation counts in the interpreted track are highly nondeterministic: test_8_multi_hop ranged 8–20 across 3 runs; no test produced identical counts across all runs.
  • Self-reported source counts are unreliable: test_6 run 2 claimed “10 sources” but produced 1 inline citation; test_4 run 3 claimed “12 distinct sources” but produced 1 citation.
  • Raw source counts were stable at 12 across all invoked tests and all context sizes - the only exception being domain filter tests, which errored, and no-search tests, 0 sources.

search_context_size

  • Latency impact is consistent in the interpreted track - high ~1.5–1.7× slower than low and inconsistent in the raw track - r2 high was faster than low.
  • Citation impact is inconsistent in both tracks.
  • Source count impact is zero in the raw track: source_count was 12 regardless of low, medium, or high across all 3 runs.

Internal Query Construction

  • search_queries_issued in the raw track contains stale date strings: internal queries appended training-era years - “2023” and “October 2023” - despite running in March 2026; "latest developments in EU AI regulation 2023". Query construction isn’t temporally aware - search_queries_issued reflects model bias, not wall-clock time.

Domain Filtering

  • Allow-list filtering - allowed_domains worked once on web_search_preview, run 2, filter_respected: true, 2 “apnews.com” sources. After switching to web_search per docs guidance, both allow-list and block-list filtering returned "Unsupported parameter 'filters'" on every subsequent run across gpt-4o and gpt-5.
  • Block-list filtering never succeeded in any configuration across 6 runs, 2 tool types, and 2 models. Attempted three parameter names - exclude_domains, excluded_domains, blocked_domains, all 400.
  • Contradiction: docs state filtering requires web_search; empirically it worked once on web_search_preview and never on web_search. Domain filtering documented, but non-functional via the Python SDK as tested.
  • Domain filtering isn’t available in the interpreted track, but Chat Completions API only.

Disambiguation

  • Both tracks resolved the ambiguous query "Python release" consistently to the programming language across all runs. Disambiguation behavior was the most stable finding in the suite.