Key Findings for OpenAI Web Search, Raw
Raw Test Workflow
- Call the Responses API with
gpt-4o,web_search_previewtool enabled - Give the model a minimal prompt; just enough to trigger retrieval
- The model may or may not
invoke web_search_previewdepending on the query - Extract raw outcomes directly from
response.outputitems:web_search_callitems: type,action.query- the internal search query issued- message items:
output_text
- Extract sources list from
response.sources- all URLs consulted, not just cited - Extract token accounting from
response.usage - Run all analysis in Python: tool invocation flag, source counts, latency
- The model never interprets or reflects on the retrieval results
- Ensure results saved to
open-ai-web-search/results/raw/
Platform Limit Summary
| Limit | Observation |
|---|---|
| Tool Invocation | Conditional, skipped for static facts and trivial math, consistent |
| Tool Invocation Visibility |
Available explicit web_search_call item in response.output |
search_context_sizeLatency Impact |
Inconsistent high was slower in run 1, run 3,but faster than low in run 2 |
search_context_sizeSource Count Impact |
None observed source count was 12 across all context sizes |
| Sources List All URLs Consulted |
Available via include=["web_search_call.action.sources"] |
| Domain Filtering Allow List |
Worked once on web_search_preview for run 1;broken on web_search across all subsequent runs |
| Domain Filtering Block List |
Never succeeded, filters parameter rejected inall configurations and models tested |
search_queries_issuedDate Accuracy |
Unreliable, model appends training-era year to internal queries despite running in 2026 |
Results Details
Run 5 =
test_8,test_9only, targeted domain filter retry onweb_search_preview;
run 5 model =gpt-5while the remainder of the test runs model =gpt-4o
Cross-run Tool Invocation
| Test | Label | R1 | R2 | R3 | R4 | R5 | R6 |
|---|---|---|---|---|---|---|---|
test_1_live_data |
Live Data | ✓ | ✓ | ✓ | ✓ | null | ✓ |
test_2_recent_event |
Recent Event | ✓ | ✓ | ✓ | ✓ | null | ✓ |
test_3_static_fact |
Static Fact | ✗ | ✗ | ✗ | ✗ | null | ✗ |
test_4_trivial_math |
Trivial Math | ✗ | ✗ | ✗ | ✗ | null | ✗ |
test_5_open_research |
Open-ended Research | ✓ | ✓ | ✓ | ✓ | null | ✓ |
test_6_context_size_low |
context_size Low |
✓ | ✓ | ✓ | ✓ | null | ✓ |
test_7_context_size_high |
context_size High |
✓ | ✓ | ✓ | ✓ | null | ✓ |
test_8_domain_filter_allowed |
Allow List Filter | ERR† |
✓* | ERR‡ |
ERR‡ |
ERR¶ |
ERR‡ |
test_9_domain_filter_blocked |
Block List Filter | ERR† |
ERR§ |
ERR‡ |
ERR‡ |
ERR¶ |
ERR‡ |
test_10_ambiguous_query |
Ambiguous Query | ✓ | ✓ | ✓ | ✓ | null | ✓ |
Domain Filter Error Progression
- †, Run 1:
"Unknown parameter: 'tools[0].filters.type'"initial schema withtype: "domain"key - *, Run 2
test_8:filter_respected: true, 2 “apnews.com” sources,web_search_preview+allowed_domains, only success across all runs - §, Run 2
test_9:"Unknown parameter: 'tools[0].filters.excluded_domains'"first block-list key attempt - ‡, R3/R4/R6:
"Unsupported parameter 'filters'"after switching toweb_searchper docs guidance - ¶, Run 5 with
gpt-5:"Unsupported parameter 'filters'"model change produced identical error
search_context_size Latency Detail
| R1 | R2 | R3 | |
|---|---|---|---|
Low Latency - ms |
10,867 | 10,251 | 9,531 |
High Latency - ms |
15,984 | 8,614 | 11,233 |
Low Source Count |
12 | 12 | 12 |
High Source Count |
12 | 12 | 12 |
Agent Ecosystem Testing