Friction Note: Roadblocks While Refining Methodology
LLM × Intelligence Matrix
Codex exposes a two-dimensional agent configuration space unique among the platforms tested: five LLM variants
GPT-5.2, GPT-5.3-Codex, GPT-5.4-mini, GPT-5.4, and GPT-5.5 each available at four intelligence levels Low,
Medium, High, and Extra High. Coverage of the matrix produces 20 runs per test ID compared to this
collection’s standard five.
The combinatorial cost produces more overhead, but collapsing the matrix introduces a different problem. Intelligence
level isn’t a passive configuration, but materially changes retrieval strategy, tool selection, runtime, and in some
cases output quality. GPT-5.2 required High intelligence to escalate to curl while GPT-5.4 did so at Low.
GPT-5.4-mini Extra High spent 85 seconds on a three-part fetch strategy that produced the same yield as a 24-second
single-fetch at Medium. Sampling one or two levels per LLM would have missed these divergences entirely.
Codex’s documentation offers a relevant caution about
intelligence levels, stated for GPT-5.5 but applicable generally:
“Higher reasoning effort isn’t automatically better. If the task has conflicting instructions, weak stopping criteria, or open-ended tool access, higher effort can lead to overthinking, unnecessary searching, or output quality regressions. Increase effort only when evals show a measurable quality gain.”
BL-1 data confirms this empirically. Extra High produced cost/yield regressions in both GPT-5.4-mini and
GPT-5.3-Codex: more tool calls, longer runtimes, and identical or lower output quality compared to Medium or High.
The retrieval task has weak stopping criteria by design. The prompt asks for measurements, not a specific content target.
web provides open-ended tool access with no built-in completion signal, risking LLM overthinking.
Methodology Decision
Log all LLM × intelligence level combinations as distinct rows. The matrix is the unit of observation for Codex testing. Where session contamination confirmed or suspected, flag affected rows rather than dropping them. The contaminated behavior is itself a finding about how Codex manages context across runs.
Where full matrix coverage is impractical for a given test ID, prioritize Low and High per LLM as the most informative
contrast pair. Low reflects default or minimal reasoning behavior while High captures the escalation threshold without the
Extra High overthinking regression. Medium and Extra High add resolution, but rarely change the verdict.
Mixed-Format Source Misidentification, Tool Selection Driver
BL-2’s URL leads to a mixed-format file with Markdown text and HTML tags. This pattern was previously observed in
Cascade-interpreted track testing,
where it produced reporting errors as agents flagged format anomalies in their completeness assessments.
Codex’s response to BL-2 uncovered that misidentification didn’t just corrupt the report, it actively drove
tool selection with measurable cost consequences.
The clearest instance was GPT-5.4-Mini Extra High, which attempted Browser use after determining the
content was “buried inside a large HTML document.” The agent read the embedded HTML table tags as evidence
that it needed a browser rendering pass to extract the real content, which led to net::ERR_BLOCKED_BY_CLIENT.
The run then fell back to curl, which retrieved the same 6,024-char plain-text Markdown body that most runs
returned, in under a minute, at a fraction of the cost. The Browser use attempt consumed 63K context tokens.
The misidentification added no retrieval value and introduced a tool failure that didn’t need to happen.
A subtler version appeared in GPT-5.4 Low, which reported truncation while simultaneously confirming clean
code fence closure and the correct character count. The sole evidence for truncation was the ~20 KB size expectation
vs the 6,024-char actual. That expectation was itself inflated by the mixed format: an agent encountering HTML
table markup inside a .md file may model the source as a rendered page with nav chrome rather than a compact raw
document, producing a larger prior on document size and a lower threshold for declaring the retrieval incomplete.
Across runs, the ce-create## Summary heading artifact and the embedded HTML table agents flagged as toolchain
corruption, parsing failure, or CMS injection. No agent identified these as stable source properties. Without access
to the raw source for cross-reference, the misidentification isn’t recoverable from agent output alone.
Methodology Decision
Cross-reference agent truncation and formatting assessments against the known source structure before logging. A false positive truncation report driven by format mismatch is a distinct finding from a true retrieval ceiling. Where misidentification produces tool escalation, not just a bad report, log the escalation path and its context cost as a direct consequence of the source format property.
Session Contamination
Running each intelligence level with an LLM sequentially in the same Codex session in BL-1 introduced a contamination vector.
Later runs could read artifacts written by earlier runs, observe prior tool outputs in context, and carry forward retrieval
strategies without re-deriving them. Across GPT-5.4 and GPT-5.5 runs, three signals co-occurred:
- Explicit Language referring to prior runs: “I’m running the direct fetch again”, “I’ll run a fresh direct fetch for this BL-1 pass”, phrasing that only makes sense if the agent knows it has run before.
- Anomalous Runtimes:
GPT-5.5 Highcompleted in 20 seconds including acurlfetch of a 505 KB file;GPT-5.4 Extra Highcompleted in 42 seconds on the same task that tookGPT-5.4 Low1 minute and 46 seconds. - Increasing Context Window Usage across levels within the same
session:
GPT-5.5consumed 35K → 36K → 38K → 40K tokens acrossLowthroughExtra High, consistent with accumulated session state rather than independent runs.
This rules out any possibility of treating intelligence level as an independent variable within shared sessions, as efficiency
gains at higher levels may reflect strategy reuse rather than superior reasoning. The convergence observed across allGPT-5.4
levels - identical character counts, token counts, tools, and last-50 characters, is consistent with both genuine LLM stability
and session memory flattening real variance. The data itself can’t distinguish these from within the session.
BL-2 results suggested wider contamination as session folders created on the same date, with artifact files present in
non-sequential sessions: web-2, web-3, web-4, web-7, web-10, web-12, web-13 and empty folders for
web-5, web-6, web-8, web-9, and web-11. The gap pattern doesn’t correspond to intelligence level order, ruling
out sequential contamination as the sole mechanism. Run 14 also reported a workspace path from session i-m-testing-codex-s-web-11
during what should have been a fresh -web-14 session.
Methodology Decision
Run each intelligence level in a fresh Codex session. Where session isolation is impractical, run levels in ascending order to
ensure at least the Low run’s uncontaminated, and flag all subsequent runs in the same session with a contamination qualifier.
Don’t interpret runtime compression or strategy convergence at higher levels as evidence of capability without ruling out context
inheritance. Filenames written to the sandbox by earlier runs are a particularly reliable contamination signal: if a later run
references a file it didn’t create in its own tool call log, the session likely contaminated.
Truncation Taxonomy
Some platforms presented truncation as a single phenomenon: the tool returned less than the page contained.
BL-1 runs revealed three distinct truncation layers that operate independently that require disambiguation
before any truncation assessment logging:
| Layer | Mechanism | Agent-detectable? | Verification-detectable? |
|---|---|---|---|
web.open Viewer Window |
Line-indexed extraction returns a windowed excerpt, not the full page; window may start at L39 or L216and not L0 |
Yes, if agent checks line count vs lines received | Indirectly, via output size vs expected |
| Terminal Display Truncation | Tool output printed inline truncated by the Codex transcript interface; …116,434 tokens truncated…notice in some runs |
Yes, notice visible in tool output | No, hidden tokens not any saved artifact |
| HTTP Response Body | Actual bytes received from the server via curl |
Yes, via wc -con saved file |
Yes, via verifier script against known size |
Early web.open-only runs conflated all three layers into a single truncation field, producing unreliable
self-reports. GPT-5.4 Low was the first run to cleanly separate all three: separating the web.open viewer window
from the terminal display truncation from the actual HTTP body, and correctly identified the body as complete while
reporting truncation in the other layers.
The three-layer LLM has a practical implication for hypothesis assessment. H1 and H2 character and token ceilings
are only testable against the HTTP response body layer. Assessments made against web.open output measure the viewer window,
not the retrieval ceiling. Runs that didn’t escalate to curl can’t meaningfully contribute to H1 or H2 verdicts
with the same confidence as runs that did.
Methodology Decision
Treat web.open output and curl output as measurements of different artifacts within the truncation taxonomy, not as better or
worse versions of the same measurement. A web.open-only run documents default retrieval behavior for that LLM and intelligence
level. A curl-escalated run documents what the agent does when it reasons past the default. Both are valid observations. The
distinction is already recoverable from the tools named column without additional logging.
web.open Line-Indexed Viewer, Not Raw Fetch
web.open doesn’t return a raw HTTP response body. It returns a line-indexed, rendered text extraction: a processed view of the page
with line numbers injected, HTML stripped, and a viewer window applied that doesn’t necessarily start at line 0. The distinction matters
for every measurement in the interpreted track:
- Character Counts from
web.openinclude injected line-number prefixes, inflating the count relative to the actual content. - Viewer Window starts at an arbitrary line offset, observed at
BL-1’sL39andL216in different runs, meaningweb.open-only runs may return a mid-document slice with no skipping signal for previous content. - Line Count - agents consistently reported
Total lines: 542, but it’s a property of the extracted text representation, not the raw HTML. - Truncation at
L477appeared acrossGPT-5.2 Medium,GPT-5.3-Codex High, andGPT-5.3-Codex Extra High. Whether this is a hardcoded viewer window limit, a pagination boundary, or a property of the document’s line structure at that point isn’t resolvable from interpreted track data alone; the raw track write task is the appropriate place to test this.
GPT-5.4 Low offered the clearest documentation of this finding:
“
web.opendid not return the raw 505 KB page body. It returned a line-extracted, partially normalized page view (Total lines: 542) centered on readable content, while a direct terminal fetch returned the full HTML.”
This suggests that web.open-only runs may not be retrieving a truncated version of the page so much as a different artifact entirely, a
rendered text view optimized for readability rather than byte-faithful retrieval. The ~85 KB ceiling observed in
GPT-5.4-mini Medium/High/Extra High may reflect the approximate size of that readable content layer rather than an infrastructure retrieval
limit. Subsequent test cycles may determine whether this holds across other URLs and page types.
Workspace Artifact Nondeterminism
BL-2 agents produced artifacts unprompted, inconsistently. About half wrote files to the local workspace or /private/tmp.
Agentic naming was also unstable across sessions and LLM versions:
GPT-5.2Medium:BL-2_create.md.htmlGPT-5.2,GPT-5.3-Codex-High:BL-2_create.mdGPT-5.2Extra High:bl-2-create.mdGPT-5.4,GPT-5.5-High:bl2_create.mdGPT-5.4Extra High:bl2_mongodb_create.mdGPT-5.4-MiniMedium:mongo_create.mdGPT-5.4 Extra High:bl2_headers.txtwith.mdartifact
The format also shifted across runs: GPT-5.2 Medium saved an HTML extraction while subsequent runs saved Markdown.
GPT-5.4 Extra High uniquely saved a separate headers file alongside the content artifact. No run produced the same
filename as another run in the same LLM family without evidence of workspace contamination. Format shifted among the
output as well in which agents produced reports in half-Markdown, some with syntax highlighting, most of it not.
Artifact presence in the chat output was equally inconsistent. Some runs identified the saved file as a clickable attachment
in the Codex response panel, but most didn’t, even when the shell log confirmed a successful write. The path disclosed in
surface awareness reports didn’t always match the session number - run 14 reported path i-m-testing-codex-s-web-11 which
should have been -web-14.
This nondeterminism makes artifact presence an unreliable signal for distinguishing fresh retrieval from workspace reads.
A run that skips web.open and goes directly to file operations may reflect a trained tool preference, session contamination,
or silent reuse of a prior artifact. Whatever the cause, they produced nearly identical report metrics and observations.
Methodology Decision
Check /private/tmp and the session workspace path at run start before any fetch operations. A non-empty workspace at the start
of a purportedly fresh run is a contamination indicator and log as such. Record artifact filename and format as a contamination
signal for subsequent runs in the same session. Don’t treat artifact absence as evidence that no retrieval occurred. The unprompted
write behavior is too inconsistent to use as a retrieval proxy. This is exactly the type of complexity the raw track is intended
to test.
Workspace Sandbox Bleed
All BL-1 runs acknowledged access to a local workspace or filesystem. The disclosed path was consistent across sessions:
/Users/rhyannonjoy/Documents/Codex/2026-05-09/i-m-testing-codex-s-web-2.
Direct inspection confirmed the sandbox exists but is empty. Agents weren’t lying: the sandbox is a session-scoped environment with read/write capability. The prompt condition “no workspace” describes the absence of this test collection’s project files, not the absence of the sandbox itself. The gap is between the framework’s intent and the Codex environment’s actual configuration. The bleed takes two forms with different implications:
Passive Bleed: agents report sandbox access but don’t use it. All web.open-only runs fall here. The disclosure is accurate
and doesn’t affect retrieval behavior.
Active Bleed: agents write artifacts to /private/tmp or the sandbox path during retrieval, then read those artifacts to
compute measurements. GPT-5.2 High and Extra High, GPT-5.4 across all levels, and GPT-5.5 across all levels used this
pattern. The artifact-then-measure approach produces more accurate character counts than reading from tool display output, but it
also means later runs in the same session may find artifacts from earlier runs in the sandbox, compounding
the session contamination problem.
A subtler form appeared in the GPT-5.5 runs: the agents executed no web or web.open calls at all, going directly to curl
and local file operations. The data alone can’t determine whether this reflects GPT-5.5’s trained preferences, session-inherited
strategy, or awareness of prior artifacts in the sandbox. The outcome, correct retrieval with terminal tools, is indistinguishable
from learned behavior, session memory, or finding a prior run’s cached file.
Methodology Decision
Log workspace disclosure as a surface characteristic, not a test anomaly. Distinguish passive disclosure from active artifact creation
in the tool visibility field. For runs where measurements derive from sandbox artifacts rather than direct tool output, document it in
the notes column, as the measurement methodology differs from web.open-only runs and the two aren’t directly comparable. For fresh-session
verification, check whether /private/tmp is empty at run start; a non-empty /private/tmp at the beginning of a purportedly fresh run is
a contamination indicator.
Agent Ecosystem Testing