Key Findings for Codex’s Web Search Behavior,
`GPT`-interpreted - Desktop

Test Workflow

Run python scripts/framework.py --test {test ID} --track codex-interpreted
Review terminal output
Copy the provided prompt asking agent to report on fetch results: character count, token estimate,
truncation status, content completeness, Markdown formatting integrity, and tool visibility
Open a new session in the Codex desktop app, paste the prompt into the chat window
Approve curl escalation and shell permission requests; skip requests for runs of local scripts
Capture the agent’s full response; observe the gap between self-report and actual retrieval behavior
as the interpreted finding
Log structured metadata as described in framework-reference.md
Ensure results saved to /results/codex-interpreted/results.csv

Platform Limit Summary

Limit	Observed
Hard Character Limit	None detected via `curl` path: successful `curl` fetches returned payloads from 660 chars to 3.1 MB with no ceiling hit; output chars on the `web` path reflect a `wordlim: 200` window, not a byte ceiling
Hard Token Limit	None detected via `curl` path: token counts ranged from ~24 to ~835,000; display truncation confirmed at ~12,970 tokens in `EC-6` tool output rendering, independent of HTTP retrieval
Output Consistency	LLM-version-stratified: same URL and intelligence level produced distinct output sizes and tool strategies across `GPT-5.2` through `GPT-5.5`; intelligence level weaker predictor than LLM version
Content Selection Behavior	Two-tier retrieval: `web` returns rendered text extraction often with `wordlim: 200`; full content requires `curl` escalation with elevated network permissions; `SC-1`:`GPT-5.3-Codex Extra High` only agent to report `response_length` suggesting `wordlim: 200` soft-cap, agent-adjustable parameter
Truncation Pattern	Three independent truncation layers: `web` line-indexed window, LLM/URL dependent - `L237–L657`, `EC-6`’s terminal display cap ~12,970 tokens, and underlying `curl` response `Loading...` placeholders
`web` Line-Indexed Window	LLM-version-URL-dependent: agent’s choice, varied across sessions, rarely started at `L0` - `BL-1`:`L140`, `BL-3`:`L453`, `EC-1`:`L479`, `OP-1`:`L305/L477/L552`, `OP-4`:`L237`, `SC-1`:`L362/L478`, `SC-3`:`L266/L309/L353`, `SC-4`:`L316/L657`
`curl` Escalation	LLM-version-dependent: `GPT-5.2` requires `Medium`+ intelligence; `GPT-5.3-Codex` typically `Medium`+, `GPT-5.4` escalates at `Low`, `GPT-5.5` bypasses `web` pipeline at all levels without exposing reasoning
Session Contamination	Fresh fetch compound: prior sessions’ artifacts persist across runs in `Documents/Codex` while `/private/tmp` clears between sessions; filename reuse observed in 42 / 261 runs, while explicit artifact reuse reported less often, write-save location pattern nondeterministic
Post-Session Auto-Editing	Data integrity risk: continues processing sessions after chats in and out of archives - output editing, thought panel collapse with reasoning and/or command execution removed, timer drift and/or removal - `GPT-5.2` timers removed completely; `Auto-review`, `Full Access` disabling has no impact on this behavior
JS-Rendered Pages	Structural retrieval failure: `SC-2` - Next.js/Netlify and `BL-3` - Next.js/Gatsby tutorial body absent from static extraction regardless of or intelligence level; `curl` returns app shell only
`Cache Miss` Failure	Systematic: agents reported `web`:`Cache Miss` on `EC-6` mutable, raw GitHub URL across all runs that attempted it; additional test ruled out a blanket host block
Self-reported Completeness	`curl`-anchored: agents conflate `curl` body completeness with overall retrieval completeness even if artifacts display otherwise; `web` truncation consistently underreported in summary assessments

Results Details


Track	`T1` GPT-interpreted, Codex Desktop App
Agents Observed	`GPT-5.2`, `GPT-5.3-Codex`, `GPT-5.4-Mini`, `GPT-5.4`, `GPT-5.5`
Intelligence Levels	`Low`, `Medium`, `High`, `Extra High`
Total Runs	261
Distinct URLs	13
Input Size Range	`EC-3`: ~660 chars to `BL-3`: ~3.1 MB
Truncation Events	195 / 261 ~75% of agents report truncation in some form - `web`-only path with limits reported explicitly: 42 - `web`→`curl` path with `web` limits reported explicitly: 114 - `web`→`curl` path with `web` limits implied in reasoning: 39 - `curl`-only path and/or no truncation signal: 66
Average Output Size	351,961 chars
Output Size Range	95 - 3,103,342 chars
Average Token Use	88,489 tokens
Token Count Range	24 - 835,000 tokens
Workspace Substitution	2 / 261 runs confirmed, contamination risk flagged in ~40 additional runs
`curl` Escalation	Dominant retrieval path, present 69% of track ~180 / 261 runs
`web` Bypass	`GPT-5.5` at all intelligence levels skipped `web` completely on at least one URL

Content Access x Intelligence

Agentic task completion isn’t a useful signal for page readability. For Codex, retrieval strategy largely influences content accessbility; its web tool returns a rendered text extraction window, but it’s up to the agent to use it and most agent’s didn’t, at least not completely. Agents across this track most often started with web, recognized its limits, and pivoted to curl to complete the task, but curl returns a raw HTTP body whose readability entirely depends on that page’s architecture. For JS-rendered pages, curl delivers app shells with prose absent. Agents rarely distinguished between having fetched a URL and having read it.

The heat map below encodes retrieval strategy, not task outcome. Rows are reasoning/intelligence levels, with LLM version as a sub-grouping. Columns are URLs ordered by content accessibility difficulty, left to right: static payloads → large static HTML → JS-rendered and/or SPAs where curl returns mostly scaffolding.

While curl is an appropriate choice to calculate metrics for some URLs, a prompt with context-specific questions - summarize a section, locate a specific value in the documentation - may have produced a different signal. This track instead uncovers a proxy: agents that used web long enough to traverse page text completely performed something closer to reading prose, as in, accessed semantic context, but agents that pivoted to curl may have retrieved code they never processed as text.

The column grouping makes the practitioner-relevant question legible: agents working with pages in the left two groups had readable content to process regardless of toolchain. Agents working with pages in the right group - EC-1’s SPA extraction at ~10% of raw, BL-3’s JS-rendered tutorial body absent from every fetch, SC-2’s CSP-nonce-gated app shell - retrieved bytes but perhaps didn’t meaningfully read regardless of intelligence level or method. Depending on LLM-version, intelligence level, and page architecture, the curl-only cells sometimes represent the highest task effort with the lowest content accessibility.

Truncation Analysis

#	Finding	Tests	Observed	Conclusion
1	`web` returns line-indexed rendered text extraction window, not full page	All tests	Returns a line-numbered, HTML-to-text-extracted viewport; `wordlim: 200` in output across `BL-1`, `OP-1`, `SC-3`, `SC-4`; `Total lines: N` reported for most URLs	Output chars on `web` path reflect viewport depth, not retrieval ceiling; `curl` only path to raw HTTP body
2	No fixed character or token ceiling detected on `curl` path	`BL-1` `BL-3` `OP-1` `OP-4` `SC-3`	`BL-3`:`GPT-5.2 Medium` largest valid fetch ~3.1M chars; `OP-4`:`GPT-5.5 Low` ~514,092 chars in 27 seconds with 8% context	Char/token constraint LLM-version-gated access, not architecturally defined
3	Three independent truncation layers disambiguated	`BL-1` `EC-6` `OP-4`	`BL-1`: `GPT-5.4 Low` first isolated all three: `web` window, terminal display cap, underlying HTTP body; `EC-6` confirmed ~12,970-token display cap independent of file size; `OP-4`:`GPT-5.4 Extra High` named all three layers	*Self-reported truncation tool-dependent; agents frequently report “no truncation”* for `curl` while `web` truncation noted in passing or implied**
4	`curl` escalation capability LLM-version-gated, not intelligence-level-gated for newer versions	`BL-1` `BL-3` `OP-4` `SC-3`	`GPT-5.2` requires `High`+ for `curl`; `GPT-5.4` escalates at `Low`; `GPT-5.5` skips `web` entirely at all levels; within `GPT-5.4-Mini`, DNS sandbox failures suppressed escalation	`curl`-first behavior LLM-version property; capability threshold collapsed from `High` to `Low` between `GPT-5.2` and `GPT-5.4`
5	Higher intelligence levels don’t produce better retrieval outcomes, `Extra High` shows cost/yield regression	`BL-1` `EC-1` `OP-4` `SC-2`	`GPT-5.4-Mini Extra High` spent 85 seconds on a 3-part fetch strategy matching `Medium`’s single-fetch result; `EC-1`:`GPT-5.2 Extra High` looped ~48 minutes on 113 `web` calls without escalating; `OP-4`:`GPT-5.5 Low` retrieved 514 KB in 27 seconds vs `GPT-5.2 High` looping ~14 minutes at 45% context	Intelligence level governs tool sophistication, not task success; `Extra High` consistently produces diminishing returns against `web`-focused prompt
6	Session contamination persistent confound	`BL-1` `BL-2` `BL-3` `EC-1` `EC-6` `SC-2` `SC-4`	`Documents/Codex` persists across sessions; artifact filenames reused across runs confirmed in 20+ cases; `BL-2`:`GPT-5.5 High` likely read prior session artifact rather than fetching; `BL-1`:`GPT-5.4 Extra High` completed task in 42 seconds vs `Low`’s ~2 minutes due to reuse	Intelligence level not independent variable within shared sessions
7	JS-rendered pages produce a structural retrieval failure, not a truncation event	`BL-3` `SC-2`	`SC-2`: Next.js / Netlify - `web` returns a consistent 142-line pre-hydration shell; nonce-based CSP, `no-store` cache policy prevent JS execution on any path; `BL-3` tutorial body absent from static extraction at a reproducible structural position `L385-L389`	Neither `web` nor `curl` returns content for CSP-gated JS-rendered pages - fundamental retrieval barrier not addressable by escalation
8	`Cache Miss` is systematic for large, mutable payloads	`EC-6`	17 of 20 `web`-runs on raw GitHub URL received `Cache Miss (no content retrieved)`; smaller `raw.githubusercontent.com` sized doc confirmed host isn’t fully blocked; no run investigated or diagnosed failure before pivoting to `curl`	Failure is URL-size-class-specific to raw GitHub payloads; agents report what succeeded, not what failed
9	`web` window LLM-version-correlated on same URL	`OP-2` `OP-4` `SC-3`	`OP-2`:`L317` dominant cutpoint for `GPT-5.2-5.4`; `L590` for `GPT-5.5`; `OP-4`:`L237` for `GPT-5.2-5.4`; `L616` for `GPT-5.5 Extra High`; `SC-3`:`L266` dominant for `GPT-5.2`/`5.4-Mini`; `L353` for `GPT-5.3-Codex`/`5.5`	Viewport window scales across LLM generations; same URL returns a larger first-fetch window in newer LLM versions
10	`wordlim: 200` soft default, not hard cap	`BL-1` `OP-1` `OP-4` `SC-3` `SC-4`	`SC-1`:`GPT-5.3-Codex Extra High` named `response_length` short vs long parameter distinction - short mode stopping ~`L362`, long mode ~`L478`; `BL-3`:`GPT-5.4 Extra High` re-issued `web` in “long response mode,” localized truncation boundary to `L385-L389`; `SC-3`: `GPT-5.4 Extra High` observed both `L266`, `L353` in a session by varying response length settings; `SC-4` shows two-stage `L316`→`L657` pattern consistent with narrow-then-wider window sequence	`wordlim: 200` pattern agent-dependent, not fixed infrastructure ceiling, not consistently named
11	`multi_tool_use.parallel` exclusive to `GPT-5.4 Extra High`-`GPT-5.5`	Most tests	Not observed in `GPT-5.2` or `GPT-5.3-Codex` at any intelligence level; first appeared in `GPT-5.4 Extra High`; consistent across all `GPT-5.5` levels	Parallel tool invocation is LLM-version capability, not an intelligence-level default

Retrieval Outcomes

Output chars on the web surface aren’t a retrieval ceiling metric, but reflect how far the agent traversed through a line-indexed renderer. Agents wrote-saved a variety of artifacts unprompted in which curl body size was partially observable. Raw tracks intend document precise artifact measurements. Rows below organized by page architecture:
raw files → static HTML → reference/wiki → JS-rendered/SPA

Test	Expected	Received	Content Accessibility	Agent Characterization
`EC-3` Redirect JSON	~2 KB	`web`: 660 chars `curl`: 254 bytes	100%	Complete: `web` pipeline likely pads response with wrapper text; `curl` returns raw body; neither represents truncation
`BL-2` Mixed HTML Markdown	~20 KB	`200`: 6,024 char count `400`: 95 char count	`200`: 100%	Complete, but misidentified: mixed format caused persistent false truncation reports across all LLMs; actual size consistently confirmed - 6,024 chars
`EC-6` Raw GitHub Markdown	~60 KB	`web`: `Cache Miss` `curl`: 91,869 char count	~100% body; display cap	No retrieval truncation: `web` `Cache Miss` error systematic while `curl`’s complete; display truncation at ~12,970 tokens is a terminal rendering cap, not a fetch limit
`SC-4` Markdown Guide	~30 KB	`web`: `L316/L657` `curl`: 64,527 char count	`curl`: 100% `web`: 50%	Complete via `curl`: `web` delivers a pageable line-indexed window; `L316/L657` cutpoints land mid-document at non-structural boundaries
`SC-1` Gemini API Docs	~40 KB	`web`: 18–33K char range `curl`: ~121.4K char count	`curl`: 100% `web`: 15-27%	Complete via `curl`: `web` `L362` short-mode ceiling confirmed; second fetch recovered through `L478`; truncation lands on page-content notice, not a structural boundary
`OP-2` MDN Docs	~120 KB	`web`: `L317/590` `curl`: 240, 370 char count	`curl`: 100% `web`: 13-25%	Complete via `curl`: `web` line window LLM-version-correlated; both cutpoints land mid-sentence at non-structural positions
`BL-1` MongoDB Docs	~85 KB	`web`: 1.6–61K~ 19K–85K `curl`: 505,339 char count	`curl`: 100% `web`: ~0.3–17%	LLM-intelligence-tool-dependent: `GPT-5.2`-`5.3-Codex` lower range, `5.4-Mini` upper range;`web` truncated at extraction’s line boundary, suggesting tool ceiling as content beyond `L140/L477` not retrieved, cutpoint `L477` consistent across `5.2 Medium`, `5.3-Codex High`-`Extra High`; `5.4-5` use of `curl` returned full response body, diambiguated truncation layers
`OP-4` CommonMark Spec	~500 KB	`web`: `L237-616` `curl`: 514,092 char count	`curl`: 100% `web`: 2-3%	Complete via `curl`: `GPT-5.2-4` stopped ~`L237` while `5.5 Extra High` stopped ~`L616`; `GPT-5.2 High` looped 14m24s at 45% context; three truncation layers identified in a single run
`OP-1` Wikipedia with URL Fragment	~40 KB	`web`: `L305/552` `curl`: 693,475 char count	`curl`: 100% `web`: ~0.5-4%	Complete via `curl`: `#History` silently dropped by both tools; full article retrieved without targeted section; `web` consistent cutpoint `L552`, content accessibility calculated by token estimates
`SC-3` Wikipedia Table-Heavy	~100 KB	`web`: `L266/309/353` `curl`: 785,605 char count	`curl`: 100% `web`: 1-3%	Complete via `curl`: `web` window varies across LLM versions; `wordlim: 200` confirmed as soft default; three distinct cutoff points across 21 runs rules out an architecturally fixed ceiling
`EC-1` Gemini API Docs	~100 KB	`web`: 13K–13.4K char range `curl`: 132,894 char count	`curl`: 100% `web`: 10%	Extraction ratio gap: `web` consistently delivers 10% of HTML; `GPT-5.2 Extra High` called `web` 113 times for 48m10s, never pivoting to `curl`
`SC-2` Anthropic API Docs	~80 KB	`web`: `L142` `curl`: ~511K–519K char range	Not accessible	Incomplete HTML shell, prose absent: reference prose is JS-hydrated, CSP nonce-gating prevents JS execution on any fetch path; `curl` delivers navigation scaffolding and/or data bundles, not documentation; artifacts include `Loading...` placeholders
`BL-3` MongoDB Docs	~250 KB	`web`: `L453` `curl`: ~3.1 MB char estimate	Not accessible	Complete HTML shell, prose absent: tutorial walkthrough is client-side rendered and not represented in static payload regardless of fetch strategy; documentation body not examined in `web` `L385–L389` extraction

Key Findings for Codex’s Web Search Behavior, GPT-interpreted - Desktop