<|°_°|> Agent Ecosystem Testing GitHub ↗

ChatGPT-interpreted vs Raw

Two Python scripts test the same web search behaviors:

ChatGPT-interpreted captures what gpt-4o-mini-search-preview believes it retrieved: how many sources it consulted, whether results felt current, how it characterizes search depth. This is the model’s self-report. The raw track captures what the API actually returned: exact tool_invoked flags, exact source lists from web_search_call.action.sources, exact token counts from response.usage. These are Python len() calls and dictionary lookups, not model estimates.

The gap between these two tracks is itself a finding. If the interpreted track reports “12 distinct sources” but the raw source_count is 1, that discrepancy belongs in the spec.

  web_search_test.py web_search_test_raw.py
API Chat Completions API Responses API
Measures Model’s interpretation of what it retrieved Raw metadata extracted directly from response object
Search invocation Implicit - model always searches, no visibility Explicit web_search_call item in response.output
Source counts Model self-report - frequently overstates Python len() on web_search_call.action.sources - exact
Citation counts url_citation annotations in message.annotations Not applicable, raw track doesn’t count inline citations
Internal query Not exposed action.query string from web_search_call item - exact
max_output_tokens Not set - model writes full assessments Set to 256 - minimal output, metadata is the signal
Token cost per run Higher - model writes long self-assessments Lower - minimal prompt, capped output
Domain filtering Not available - Chat Completions API only Available on web_search tool, non-functional as tested
Sources list Not available, Chat Completions API only Available via include=["web_search_call.action.sources"]
Best used for Understanding what the model perceives it retrieved Citable measurements for the spec

Agent Docs Spec - Known Platform Limits

Both tracks agree on the following and are the citable findings for the spec:

Tool Invocation

Citation and Source Counts

search_context_size

Internal Query Construction

Domain Filtering

Disambiguation