Key Findings for OpenAI Web Search, ChatGPT-interpreted

Test Workflow

Call the Chat Completions API with gpt-4o-mini-search-preview
Give the model a detailed prompt asking it to describe what it retrieved -
result quality, recency, completeness, any failures
The model “always searches” before generating a response; no tool
plumbing exposed to the caller
Capture model’s full text response as the interpreted finding
Capture inline url_citation annotations from message.annotations
for cross-referencing against the raw track
The gap between the model’s self-report and raw citation counts is itself
a finding, discrepancies belong in the spec
Ensure results saved to open-ai-web-search/results/chatgpt-interpreted/

Platform Limit Summary

Limit	Observation
Citation Coun per Response	0–20 high variance, non-deterministic
`search_context_size` Latency Impact	Consistent, `high` ~1.5–1.7× slower than `low`
`search_context_size` Citation Impact	Inconsistent across runs
Static Fact Search Skip	Non-deterministic, skipped in 2/3 runs
Self-reported Source Count Accuracy	Unreliable, frequently overstates inline citations
Sources List all URLs consulted	Not available, Chat Completions API doesn’t expose a `sources` field
Domain Filtering	Not available, Chat Completions search models don’t support `filters`
Tool Invocation Visibility	Not available, search is implicit, no `web_search_call` item

Results Details

Model: gpt-4o-mini-search-preview · 3 runs

*5 runs total: first two runs ran without credits, errored out

Cross-run Citation Counts

Test	Label	R1	R2	R3
`test_1_live_data`	Live data	4	0	0
`test_2_recent_event`	Recent event	6	6	3
`test_3_static_fact`	Static fact	0	1	0
`test_4_open_research`	Open-ended research	6	3	1
`test_5_ambiguous_query`	Ambiguous query	3	3	2
`test_6_search_context_low`	`context_size`: low	3	1	4
`test_7_search_context_high`	`context_size`: high	4	9	3
`test_8_multi_hop`	Multi-hop research	8	20	9

Truncation Analysis

#	Finding	Tests	Observed	Spec Contribution
1	Citation count highly non-deterministic	`test_1` `test_8`	`test_8_multi_hop` ranged 8–20 citations; `test_1_live_data` returned 4 in run 1, 0 in runs 2-3; no test produced identical citation counts	Citation count isn’t reliable proxy for search depth or result quality in this track
2	“Always-search” model doesn’t always produce citations, doesn’t always search	`test_1`	`test_1_live_data` returned 0 citations in runs 2-3 yet produced accurate live BTC prices in a structured block with no `url_citation` annotations; model retrieved live data without citing	`citation_count == 0` doesn’t mean search wasn’t performed; citation count ≠ search invocation in this track
3	Static fact search behavior inconsistent	`test_3`	run 1: 0 citations, stated “answered from memory”; run 2: 1 citation - Britannica - stated “answered from memory,” but searched anyway; run 3: 0 citations, stated “answered from memory”	Model’s self-report of search behavior isn’t reliable indicator of search performance
4	Self-reported source counts diverge significantly from inline citation counts	`test_4` `test_6`	`test_6` run 2 reported “10 sources,” but produced 1 inline citation. `test_4` run 3 reported “12 distinct sources,” but produced 1 citation, a YouTube video	Self-reported counts aren’t verifiable from the response object; no `sources` field equivalent exists in Chat Completions
5	`search_context_size` latency impact consistent; citation impact isn’t	`test_6` `test_7`	`high` was consistently ~1.5–1.7× slower than `low`; citation counts didn’t follow same pattern in run 3, `low` - 4, outperformed `high` - 3; token count more reliably higher for `high`, see latency table below	`search_context_size` reliable latency lever, but it’s not a reliable citation-depth lever
6	Multi-hop query produces highest variance	`test_8`	Citation range: 8–20. Latency range: 8406–9869 ms; token range: 916–1333; run 2 produced fully structured Markdown table; run 1, run 3 used inline prose citations only	*Response format* non-deterministic** in addition to citation count for complex multi-source queries
7	Ambiguous query resolves consistently to programming language	`test_5`	Defaulted to Python programming language; all acknowledged the animal interpretation but deprioritized without prompting; no run searched for the animal first	Disambiguation behavior most stable finding, more consistent than citation count for any other test

`search_context_size` Latency Detail

	R1	R2	R3
`Low` Latency ms	2,983	4,725	2,888
`High` Latency ms	6,256	8,203	4,490
`Low` Citations	3	1	4
`High` Citations	4	9	3

Key Findings for OpenAI Web Search, ChatGPT-interpreted

Test Workflow

Platform Limit Summary

Results Details

Cross-run Citation Counts

Truncation Analysis

search_context_size Latency Detail

`search_context_size` Latency Detail