Agent Ecosystem Testing

Key Findings for OpenAI Web Search, Raw


Raw Test Workflow

  1. Call the Responses API with gpt-4o, web_search_preview tool enabled
  2. Give the model a minimal prompt; just enough to trigger retrieval
  3. The model may or may not invoke web_search_preview depending on the query
  4. Extract raw outcomes directly from response.output items:
    • web_search_call items: type, action.query - the internal search query issued
    • message items: output_text
  5. Extract sources list from response.sources - all URLs consulted, not just cited
  6. Extract token accounting from response.usage
  7. Run all analysis in Python: tool invocation flag, source counts, latency
  8. The model never interprets or reflects on the retrieval results
  9. Ensure results saved to open-ai-web-search/results/raw/

Platform Limit Summary

Limit Observation
Tool Invocation Conditional, skipped for static facts and trivial math, consistent
Tool Invocation
Visibility
Available explicit web_search_call item in response.output
search_context_size
Latency Impact
Inconsistent high was slower in run 1, run 3,
but faster than low in run 2
search_context_size
Source Count Impact
None observed source count was 12 across all context sizes
Sources List
All URLs Consulted
Available via include=["web_search_call.action.sources"]
Domain Filtering
Allow List
Worked once on web_search_preview for run 1;
broken on web_search across all subsequent runs
Domain Filtering
Block List
Never succeeded, filters parameter rejected in
all configurations and models tested
search_queries_issued
Date Accuracy
Unreliable, model appends training-era year to internal
queries despite running in 2026

Results Details

Run 5 = test_8, test_9 only, targeted domain filter retry on web_search_preview;
run 5 model = gpt-5 while the remainder of the test runs model = gpt-4o

Cross-run Tool Invocation

Test Label R1 R2 R3 R4 R5 R6
test_1_live_data Live Data null
test_2_recent_event Recent Event null
test_3_static_fact Static Fact null
test_4_trivial_math Trivial Math null
test_5_open_research Open-ended Research null
test_6_context_size_low context_size Low null
test_7_context_size_high context_size High null
test_8_domain_filter_allowed Allow List Filter ERR ✓* ERR ERR ERR ERR
test_9_domain_filter_blocked Block List Filter ERR ERR§ ERR ERR ERR ERR
test_10_ambiguous_query Ambiguous Query null

Domain Filter Error Progression

  • †, Run 1: "Unknown parameter: 'tools[0].filters.type'" initial schema with type: "domain" key
  • *, Run 2 test_8: filter_respected: true, 2 “apnews.com” sources, web_search_preview + allowed_domains, only success across all runs
  • §, Run 2 test_9: "Unknown parameter: 'tools[0].filters.excluded_domains'" first block-list key attempt
  • ‡, R3/R4/R6: "Unsupported parameter 'filters'" after switching to web_search per docs guidance
  • ¶, Run 5 with gpt-5: "Unsupported parameter 'filters'" model change produced identical error

search_context_size Latency Detail

  R1 R2 R3
Low Latency - ms 10,867 10,251 9,531
High Latency - ms 15,984 8,614 11,233
Low Source Count 12 12 12
High Source Count 12 12 12