Key Findings for OpenAI Web Search, ChatGPT-interpreted
Test Workflow
- Call the Chat Completions API with
gpt-4o-mini-search-preview - Give the model a detailed prompt asking it to describe what it retrieved -
result quality, recency, completeness, any failures - The model “always searches” before generating a response; no tool
plumbing exposed to the caller - Capture model’s full text response as the interpreted finding
- Capture inline
url_citationannotations frommessage.annotations
for cross-referencing against the raw track - The gap between the model’s self-report and raw citation counts is itself
a finding, discrepancies belong in the spec - Ensure results saved to
open-ai-web-search/results/chatgpt-interpreted/
Platform Limit Summary
| Limit | Observation |
|---|---|
| Citation Coun per Response |
0–20 high variance, non-deterministic |
search_context_sizeLatency Impact |
Consistent, high ~1.5–1.7× slower than low |
search_context_sizeCitation Impact |
Inconsistent across runs |
| Static Fact Search Skip | Non-deterministic, skipped in 2/3 runs |
| Self-reported Source Count Accuracy |
Unreliable, frequently overstates inline citations |
| Sources List all URLs consulted |
Not available, Chat Completions API doesn’t expose a sources field |
| Domain Filtering | Not available, Chat Completions search models don’t support filters |
| Tool Invocation Visibility |
Not available, search is implicit, no web_search_call item |
Results Details
Model: gpt-4o-mini-search-preview · 3 runs
*5 runs total: first two runs ran without credits, errored out
Cross-run Citation Counts
| Test | Label | R1 | R2 | R3 |
|---|---|---|---|---|
test_1_live_data |
Live data | 4 | 0 | 0 |
test_2_recent_event |
Recent event | 6 | 6 | 3 |
test_3_static_fact |
Static fact | 0 | 1 | 0 |
test_4_open_research |
Open-ended research | 6 | 3 | 1 |
test_5_ambiguous_query |
Ambiguous query | 3 | 3 | 2 |
test_6_search_context_low |
context_size: low |
3 | 1 | 4 |
test_7_search_context_high |
context_size: high |
4 | 9 | 3 |
test_8_multi_hop |
Multi-hop research | 8 | 20 | 9 |
Truncation Analysis
| # | Finding | Tests | Observed | Spec Contribution |
|---|---|---|---|---|
| 1 | Citation count highly non-deterministic | test_1 test_8 |
test_8_multi_hop ranged 8–20 citations; test_1_live_data returned 4 in run 1, 0 in runs 2-3; no test produced identical citation counts |
Citation count isn’t reliable proxy for search depth or result quality in this track |
| 2 | “Always-search” model doesn’t always produce citations, doesn’t always search | test_1 |
test_1_live_data returned 0 citations in runs 2-3 yet produced accurate live BTC prices in a structured block with no url_citation annotations; model retrieved live data without citing |
citation_count == 0 doesn’t mean search wasn’t performed; citation count ≠ search invocation in this track |
| 3 | Static fact search behavior inconsistent | test_3 |
run 1: 0 citations, stated “answered from memory”; run 2: 1 citation - Britannica - stated “answered from memory,” but searched anyway; run 3: 0 citations, stated “answered from memory” | Model’s self-report of search behavior isn’t reliable indicator of search performance |
| 4 | Self-reported source counts diverge significantly from inline citation counts | test_4 test_6 |
test_6 run 2 reported “10 sources,” but produced 1 inline citation. test_4 run 3 reported “12 distinct sources,” but produced 1 citation, a YouTube video |
Self-reported counts aren’t verifiable from the response object; no sources field equivalent exists in Chat Completions |
| 5 | search_context_size latency impact consistent; citation impact isn’t |
test_6 test_7 |
high was consistently ~1.5–1.7× slower than low; citation counts didn’t follow same pattern in run 3, low - 4, outperformed high - 3; token count more reliably higher for high, see latency table below |
search_context_size reliable latency lever, but it’s not a reliable citation-depth lever |
| 6 | Multi-hop query produces highest variance | test_8 |
Citation range: 8–20. Latency range: 8406–9869 ms; token range: 916–1333; run 2 produced fully structured Markdown table; run 1, run 3 used inline prose citations only | Response format non-deterministic in addition to citation count for complex multi-source queries |
| 7 | Ambiguous query resolves consistently to programming language | test_5 |
Defaulted to Python programming language; all acknowledged the animal interpretation but deprioritized without prompting; no run searched for the animal first | Disambiguation behavior most stable finding, more consistent than citation count for any other test |
search_context_size Latency Detail
| R1 | R2 | R3 | |
|---|---|---|---|
Low Latency ms |
2,983 | 4,725 | 2,888 |
High Latency ms |
6,256 | 8,203 | 4,490 |
Low Citations |
3 | 1 | 4 |
High Citations |
4 | 9 | 3 |
Agent Ecosystem Testing