Key Findings for Codex’s Web Search Behavior, GPT-interpreted - Desktop
Test Workflow
- Run
python framework.py --test {test ID} --track codex-interpreted - Review terminal output
- Copy the provided prompt asking agent to report on fetch results:
character count, token estimate,
truncation status, content completeness, Markdown formatting integrity, and tool visibility - Open a new session in the Codex desktop app, paste the prompt into the chat window
- Approve
curlescalation and shell permission requests; skip requests for runs of local scripts - Capture the agent’s full response; observe the gap between self-report and actual retrieval behavior
as the interpreted finding - Log structured metadata as described in
framework-reference.md - Ensure results saved to
/results/codex-interpreted/results.csv
Platform Limit Summary
| Limit | Observed |
|---|---|
| Hard Character Limit |
None detected via curl path: successful curl fetches returned payloads from 660 chars to 3.1 MB with no ceiling hit; output chars on the web path reflect a wordlim: 200 window, not a byte ceiling |
| Hard Token Limit |
None detected via curl path: token counts ranged from ~24 to ~835,000; display truncation confirmed at ~12,970 tokens in EC-6 tool output rendering, independent of HTTP retrieval |
| Output Consistency |
LLM-version-stratified: same URL and intelligence level produced distinct output sizes and tool strategies across GPT-5.2 through GPT-5.5; intelligence level weaker predictor than LLM version |
| Content Selection Behavior |
Two-tier retrieval: web returns rendered text extraction often with wordlim: 200; full content requires curl escalation with elevated network permissions; SC-1:GPT-5.3-Codex Extra High only agent to report response_length suggesting wordlim: 200 soft-cap, agent-adjustable parameter |
| Truncation Pattern |
Three independent truncation layers: web line-indexed window, LLM/URL dependent - L237–L657, EC-6’s terminal display cap ~12,970 tokens, and underlying curl response Loading... placeholders |
webLine-Indexed Window |
LLM-version-URL-dependent: agent’s choice, varied across sessions, rarely started at L0 - BL-1:L140, BL-3:L453, EC-1:L479, OP-1:L305/L477/L552, OP-4:L237, SC-1:L362/L478, SC-3:L266/L309/L353, SC-4:L316/L657 |
curlEscalation |
LLM-version-dependent: GPT-5.2 requires Medium+ intelligence; GPT-5.3-Codex typically Medium+, GPT-5.4 escalates at Low, GPT-5.5 bypasses web pipeline at all levels without exposing reasoning |
| Session Contamination |
Fresh fetch compound: prior sessions’ artifacts persist across runs in Documents/Codex while /private/tmp clears between sessions; filename reuse observed in 42 / 261 runs, while explicit artifact reuse reported less often, write-save location pattern nondeterministic |
| Post-Session Auto-Editing | Data integrity risk: continues processing sessions after chats in and out of archives - output editing, thought panel collapse with reasoning and/or command execution removed, timer drift and/or removal - GPT-5.2 timers removed completely; Auto-review, Full Access disabling has no impact on this behavior |
| JS-Rendered Pages |
Structural retrieval failure: SC-2 - Next.js/Netlify and BL-3 - Next.js/Gatsby tutorial body absent from static extraction regardless of or intelligence level; curl returns app shell only |
Cache MissFailure |
Systematic: agents reported web:Cache Miss on EC-6 mutable, raw GitHub URL across all runs that attempted it; additional test ruled out a blanket host block |
| Self-reported Completeness | curl-anchored: agents conflate curl body completeness with overall retrieval completeness even if artifacts display otherwise; web truncation consistently underreported in summary assessments |
Results Details
| Track | T1 GPT-interpreted, Codex Desktop App |
| Agents Observed | GPT-5.2, GPT-5.3-Codex, GPT-5.4-Mini, GPT-5.4, GPT-5.5 |
| Intelligence Levels | Low, Medium, High, Extra High |
| Total Runs | 261 |
| Distinct URLs | 13 |
| Input Size Range | EC-3: ~660 chars to BL-3: ~3.1 MB |
| Truncation Events | 195 / 261 - 78% of agents report truncation in some form - web-only path with limits reported explicitly: 42 - web→curl path with web limits reported explicitly: 114- web→curl path with web limits implied in reasoning: 39- curl-only path and/or no truncation signal: 66 |
| Average Output Size | 351,961 chars |
| Output Size Range | 95 - 3,103,342 chars |
| Average Token Use | 88,489 tokens |
| Token Count Range | 24 - 835,000 tokens |
| Workspace Substitution | 2 / 261 runs confirmed, contamination risk flagged in ~40 additional runs |
curl Escalation |
Dominant retrieval path, present 69% of track ~180 / 261 runs |
web Bypass |
GPT-5.5 at all intelligence levels skipped web completely on at least one URL |
Content Access x Intelligence
Agentic task completion isn’t a useful signal for page readability. For Codex, retrieval strategy largely influences content accessbility; its web tool
returns a rendered text extraction window, but it’s up to the agent to use it and most agent’s didn’t, at least not completely. Agents across this track most often
started with web, recognized its limits, and pivoted to curl to complete the task, but curl returns a raw HTTP body whose readability entirely depends on that
page’s architecture. For JS-rendered pages, curl delivers app shells with prose absent. Agents rarely distinguished between having fetched a URL and
having read it.
The heatmap below encodes retrieval strategy, not task outcome. Rows are reasoning/intelligence levels, with LLM version as a sub-grouping. Columns are URLs ordered
by content accessibility difficulty, left to right: static payloads → large static HTML → JS-rendered and/or SPAs where curl returns mostly scaffolding.
While curl is an appropriate choice to calculate metrics for some URLs, a prompt with context-specific questions - summarize a section, locate a specific value in
the documentation - may have produced a different signal. This track instead uncovers a proxy: agents that used web long enough to traverse page text completely
performed something closer to reading prose, as in, accessed semantic context, but agents that pivoted to curl may have retrieved code they never processed
as text.
The column grouping makes the practitioner-relevant question legible: agents working with pages in the left two groups had readable content to process regardless of
toolchain. Agents working with pages in the right group - EC-1’s SPA extraction at ~10% of raw, BL-3’s JS-rendered tutorial body absent from every fetch,
SC-2’s CSP-nonce-gated app shell - retrieved bytes but perhaps didn’t meaningfully read regardless of intelligence level or method. Depending on LLM-version,
intelligence level, and page architecture, the curl-only cells sometimes represent the highest task effort with the lowest content accessibility.
Truncation Analysis
| # | Finding | Tests | Observed | Conclusion |
|---|---|---|---|---|
| 1 | web returns line-indexed rendered text extraction window, not full page |
All tests | Returns a line-numbered, HTML-to-text-extracted viewport; wordlim: 200 in output across BL-1, OP-1, SC-3, SC-4; Total lines: N reported for most URLs |
Output chars on web path reflect viewport depth, not retrieval ceiling; curl only path to raw HTTP body |
| 2 | No fixed character or token ceiling detected on curl path |
BL-1BL-3OP-1 OP-4 SC-3 |
BL-3:GPT-5.2 Medium largest valid fetch ~3.1M chars; OP-4:GPT-5.5 Low ~514,092 chars in 27 seconds with 8% context |
Char/token constraint LLM-version-gated access, not architecturally defined |
| 3 | Three independent truncation layers disambiguated | BL-1 EC-6 OP-4 |
BL-1: GPT-5.4 Low first isolated all three: web window, terminal display cap, underlying HTTP body; EC-6 confirmed ~12,970-token display cap independent of file size; OP-4:GPT-5.4 Extra High named all three layers |
Self-reported truncation tool-dependent; agents frequently report “no truncation” for curl while web truncation noted in passing or implied |
| 4 | curl escalation capability LLM-version-gated, not intelligence-level-gated for newer versions |
BL-1BL-3OP-4 SC-3 |
GPT-5.2 requires High+ for curl;GPT-5.4 escalates at Low; GPT-5.5 skips web entirely at all levels; within GPT-5.4-Mini, DNS sandbox failures suppressed escalation |
curl-first behavior LLM-version property; capability threshold collapsed from High to Low between GPT-5.2 and GPT-5.4 |
| 5 | Higher intelligence levels don’t produce better retrieval outcomes, Extra High shows cost/yield regression |
BL-1 EC-1 OP-4 SC-2 |
GPT-5.4-Mini Extra High spent 85 seconds on a 3-part fetch strategy matching Medium’s single-fetch result; EC-1:GPT-5.2 Extra High looped ~48 minutes on 113 web calls without escalating;OP-4:GPT-5.5 Low retrieved 514 KB in 27 seconds vs GPT-5.2 High looping ~14 minutes at 45% context |
Intelligence level governs tool sophistication, not task success; Extra High consistently produces diminishing returns against web-focused prompt |
| 6 | Session contamination persistent confound | BL-1BL-2BL-3 EC-1 EC-6 SC-2 SC-4 |
Documents/Codex persists across sessions; artifact filenames reused across runs confirmed in 20+ cases;BL-2:GPT-5.5 High likely read prior session artifact rather than fetching;BL-1:GPT-5.4 Extra High completed task in 42 seconds vs Low’s ~2 minutes due to reuse |
Intelligence level not independent variable within shared sessions |
| 7 | JS-rendered pages produce a structural retrieval failure, not a truncation event | BL-3 SC-2 |
SC-2: Next.js / Netlify - web returns a consistent 142-line pre-hydration shell; nonce-based CSP, no-store cache policy prevent JS execution on any path; BL-3 tutorial body absent from static extraction at a reproducible structural position L385-L389 |
Neither web nor curl returns content for CSP-gated JS-rendered pages - fundamental retrieval barrier not addressable by escalation |
| 8 | Cache Miss is systematic for large, mutable payloads |
EC-6 |
17 of 20 web-runs on raw GitHub URL received Cache Miss (no content retrieved); smaller raw.githubusercontent.com sized doc confirmed host isn’t fully blocked; no run investigated or diagnosed failure before pivoting to curl |
Failure is URL-size-class-specific to raw GitHub payloads; agents report what succeeded, not what failed |
| 9 | web window LLM-version-correlated on same URL |
OP-2 OP-4 SC-3 |
OP-2:L317 dominant cutpoint forGPT-5.2-5.4; L590 for GPT-5.5;OP-4:L237 for GPT-5.2-5.4; L616 for GPT-5.5 Extra High; SC-3:L266 dominant for GPT-5.2/5.4-Mini; L353 forGPT-5.3-Codex/5.5 |
Viewport window scales across LLM generations; same URL returns a larger first-fetch window in newer LLM versions |
| 10 | wordlim: 200 soft default, not hard cap |
BL-1 OP-1 OP-4 SC-3 SC-4 |
SC-1:GPT-5.3-Codex Extra High named response_length short vs long parameter distinction - short mode stopping ~L362, long mode ~L478; BL-3:GPT-5.4 Extra High re-issued web in “long response mode,” localized truncation boundary to L385-L389; SC-3: GPT-5.4 Extra High observed both L266, L353 in a session by varying response length settings;SC-4 shows two-stage L316→L657 pattern consistent with narrow-then-wider window sequence |
wordlim: 200 pattern agent-dependent, not fixed infrastructure ceiling, not consistently named |
| 11 | multi_tool_use.parallel exclusive to GPT-5.4 Extra High-GPT-5.5 |
Most tests | Not observed in GPT-5.2 or GPT-5.3-Codex at any intelligence level; first appeared in GPT-5.4 Extra High; consistent across all GPT-5.5 levels |
Parallel tool invocation is LLM-version capability, not an intelligence-level default |
Retrieval Outcomes
Output chars on the web surface aren’t a retrieval ceiling metric, but reflect how far the agent traversed through a line-indexed renderer.
Agents wrote-saved a variety of artifacts unprompted in which curl body size was partially observable. Raw tracks intend document precise
artifact measurements. Rows below organized by page architecture:
raw files → static HTML → reference/wiki → JS-rendered/SPA
| Test | Expected | Received | Content Accessibility | Agent Characterization |
|---|---|---|---|---|
EC-3Redirect JSON |
~2 KB | web: 660 charscurl: 254 bytes |
100% | Complete: web pipeline likely pads response with wrapper text; curl returns raw body; neither represents truncation |
BL-2Raw Markdown |
~20 KB | 200: 6,024char count 400: 95char count |
200: 100% |
Complete, but misidentified: mixed format caused persistent false truncation reports across all LLMs; actual size consistently confirmed - 6,024 chars |
EC-6Raw GitHub Markdown |
~60 KB | web:Cache Misscurl: 91,869 char count |
~100% body; display cap |
No retrieval truncation: web Cache Miss error systematic while curl’s complete; display truncation at ~12,970 tokens is a terminal rendering cap, not a fetch limit |
SC-4Markdown Guide |
~30 KB | web: L316/L657curl: 64,527 char count |
web: 50%curl: 100% |
Complete via curl: web delivers a pageable line-indexed window; L316/L657 cutpoints land mid-document at non-structural boundaries |
SC-1Gemini API Docs |
~40 KB | web:18–33K char range curl: ~121.4Kchar count |
web: 15-27%curl: 100% |
Complete via curl: web L362 short-mode ceiling confirmed; second fetch recovered through L478; truncation lands on page-content notice, not a structural boundary |
OP-2MDN Reference |
~120 KB | web: L317/590 curl: 240, 370 char count |
web: 13-25%curl: 100% |
Complete via curl: web line window LLM-version-correlated; both cutpoints land mid-sentence at non-structural positions |
BL-1MongoDB Reference |
~85 KB | web:1.6–61K~ 19K–85K curl: 505,339 char count |
web:~0.3–17% curl: 100% |
LLM-intelligence-tool-dependent: GPT-5.2-5.3-Codex lower range, 5.4-Mini upper range;web truncated at extraction’s line boundary, suggesting tool ceiling as content beyond L140/L477 not retrieved, cutpoint L477 consistent across 5.2 Medium, 5.3-Codex High-Extra High; 5.4-5 use of curl returned full response body, diambiguated truncation layers |
OP-4CommonMark Spec |
~500 KB | web: L237-616curl: 514,092 char count |
web: 2-3%curl: 100% |
Complete via curl: GPT-5.2-4 stopped ~L237 while 5.5 Extra High stopped ~L616; GPT-5.2 High looped 14m24s at 45% context; three truncation layers identified in a single run |
OP-1Wikipedia with URL Fragment |
~40 KB | web: L305/552 curl: 693,475 char count |
web: ~0.5-4%curl: 100% |
Complete via curl: #History silently dropped by both tools; full article retrieved without targeted section; web consistent cutpoint L552, content accessibility calculated by token estimates |
SC-3Wikipedia Table-Heavy |
~100 KB | web: L266/309/353curl: 785,605 char count |
web: 1-3% curl: 100% |
Complete via curl: web window varies across LLM versions; wordlim: 200 confirmed as soft default; three distinct cutoff points across 21 runs rules out an architecturally fixed ceiling |
EC-1Gemini API Docs |
~100 KB | web: 13K–13.4K char rangecurl: 132,894 char count |
web: 10%curl: 100% |
Extraction ratio gap: web consistently delivers 10% of HTML; GPT-5.2 Extra High called web 113 times for 48m10s, never pivoting to curl |
SC-2Anthropic API Docs |
~80 KB | web: L142 curl:~511K–519K char range |
Not accessible | Incomplete HTML shell, prose absent: reference prose is JS-hydrated, CSP nonce-gating prevents JS execution on any fetch path; curl delivers navigation scaffolding and/or data bundles, not documentation; artifacts include Loading... placeholders |
BL-3MongoDB Tutorial |
~250 KB | web: L453 curl: ~3.1 MB char estimate |
Not accessible | Complete HTML shell, prose absent: tutorial walkthrough is client-side rendered and not represented in static payload regardless of fetch strategy; documentation body not examined in web L385–L389 extraction |
Agent Ecosystem Testing