| 1 |
Citation count is highly non-deterministic |
test_1, test_8, all 3 runs |
test_8_multi_hop ranged 8–20 citations; test_1_live_data returned 4 in r1, 0 in r2 & r3; no test produced identical citation counts across all 3 runs |
Citation count isn’t a reliable proxy for search depth or result quality in this track |
| 2 |
“Always-search” model doesn’t always produce citations and doesn’t always search |
test_1, r2 & r3 |
test_1_live_data returned 0 citations in r2 & r3 yet produced accurate live BTC prices in a structured block with no url_citation annotations; model retrieved live data without citing |
citation_count == 0 doesn’t mean search wasn’t performed; citation count ≠ search invocation in this track |
| 3 |
Static fact search behavior is inconsistent across runs |
test_3, all 3 runs |
r1: 0 citations, stated “answered from memory”; r2: 1 citation - Britannica - stated “answered from memory” but searched anyway; r3: 0 citations, stated “answered from memory” |
The model’s self-report of search behavior isn’t a reliable indicator of search performance |
| 4 |
Self-reported source counts diverge significantly from inline citation counts |
test_4, test_6, r2 & r3 |
test_6 r2 reported “10 sources” but produced 1 inline citation. test_4 r3 reported “12 distinct sources” but produced 1 citation - a YouTube video |
Self-reported counts aren’t verifiable from the response object; no sources field equivalent exists in the Chat Completions API |
| 5 |
search_context_size latency impact is consistent; citation impact isn’t |
test_6 vs test_7, all 3 runs |
high was consistently ~1.5–1.7× slower than low; citation counts didn’t follow the same pattern - in r3, low - 4, outperformed high - 3; token count was more reliably higher for high - see latency table below |
search_context_size is a reliable latency lever, but it’s not a reliable citation-depth lever |
| 6 |
Multi-hop query produces the highest variance overall |
test_8, all 3 runs |
Citation range: 8–20. Latency range: 8406–9869 ms; token range: 916–1333; r2 produced a fully structured Markdown table; r1 & r3 used inline prose citations only |
Response format is non-deterministic in addition to citation count for complex multi-source queries |
| 7 |
Ambiguous query resolves consistently to programming language |
test_5, all 3 runs |
All 3 runs defaulted to Python programming language; all acknowledged the animal interpretation but deprioritized it without prompting; no run searched for the animal first |
Disambiguation behavior was the most stable finding across runs, more consistent than citation count for any other test |