Codex Framework Reference
This framework generates standardized test prompts and logs CSV results, enabling consistent testing across cases, measurement tracking over time, truncation pattern identification, and retrieval behavior comparisons across tracks:
Codex and VS Code-Codex interpreted, Codex and VS Code-Codex raw.
Requirements: Python 3.8+, OpenAI Codex Desktop, and VS Code Codex Extension
Installation
# Clone and/or navigate to agent-ecosystem-testing directory
cd agent-ecosystem-testing
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# Windows: venv\Scripts\activate
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Navigate to Codex testing directory
cd open-ai-codex-web-search
In the event of incompatible Python versions or corruption, use
rm -rf venvto remove thevenvto start over
Baseline Testing Path
- Run
T1to establish surface-isolated behavioral baseline - Run
T2to isolate workspace effect againstT1 - Run
T3for ground truth retrieval measurements, verifyT1 - Run
T4to isolate surface effect on raw retrieval, verifyT2 - Run each test a minimum of 5x/track to capture variance
| Test IDs | Purpose | Key Question |
|---|---|---|
BL-1BL-2 |
Baseline truncation threshold on small pages |
What’s the T1 vs T2 surface delta? |
SC-2 |
API docs - code blocks | How does the web toolchain handle code structure? |
OP-1 |
Fragment identifier navigation | Does Codex jump to a specific section via URL fragment? |
OP-2 |
Midrange reference - headings, code blocks |
Does behavior change at midrange size with structured content? |
OP-4 |
Auto-chunking aboveBL-3 ceiling |
Does the agent fetch with multi-step tool chains? |
BL-3 |
Hard ceiling | What’s the absolute output limit across retrieval runs? |
SC-1SC-3SC-4 |
Structured content - API docs, table-heavy, nested headings |
Does truncation respect Markdown boundaries? |
EC-1EC-3EC-6 |
Edge cases - line-wrapping, JSON redirect, SPA |
What are the failure modes and workspace substitution edge behaviors? |
Rename output files to capture variance; if results are consistent, remove files to prevent contamination
Workflow
-
List Available Tests and Tracks
python scripts/framework.py --list-tests python scripts/framework.py --list-tracks -
Generate Test Prompt for a Single Test
# T1: GPT-interpreted, Codex Desktop python scripts/framework.py --test BL-1 --track codex-interpreted # T2: GPT-interpreted, Codex Extension python scripts/framework.py --test BL-1 --track vscode-codex-interpreted # T3: Raw verbatim output, Codex Desktop python scripts/framework.py --test BL-1 --track codex-raw # T4: Raw verbatim output, Codex Extension python scripts/framework.py --test BL-1 --track vscode-codex-raw -
Copy Prompt → Run in Codex
- Review the terminal output → copy the prompt
- Open the Codex IDE or VS Code-Codex chat window → paste the prompt
- Inspect retrieval behavior → examine agent output
-
Assess Hypotheses Against Agent Output
ID Description Question H1Character-based truncation at fixed limit Is there a ceiling at ~10–100 KB? H2Token-based truncation Is there a ceiling at ~2,000 tokens? H3Structure-aware truncation Does truncation fall on Markdown boundaries rather than
arbitrary byte positions?H4Surface impact on retrieval behavior Does the Codex IDE versus VS Code-Codex surface
produce different retrieval behavior?H5Auto-chunking and/or pagination Does the agent fetch with multi-step tool chains, or
only when reasoned into it? -
Examine, Log, Analyze
- Examine Codex rollout logs, details in Rollout Observability
- Log results with
log.py, read Logging & Verification - Analyze results with
analyze.py, visit Analysis
Rollout Observability
Examine
~/.codex/sessions/rolloutslogs for session structure and anomalies. Point scripts atresults/{track}/artifacts/rolloutsfor parsing.
results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-2026-06-11T14-08-50-....jsonl
Session Overview
session_reader.py produces a structured report from one or more rollouts including
session metadata, model, sandbox policy, skills, token usage, tool calls, reasoning
presence, and the conversation.
# Text report to stdout
python scripts/session_reader.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl
# HTML report
python scripts/session_reader.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl -o report.html
# List sessions and filter by ID
python scripts/session_reader.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --list-sessions
python scripts/session_reader.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --session-id <id>
Rollout Audit
rollout_audit.py checks logs for duplicate emissions, timing drift, live event stream and transcript
mismatches, post-completion records, and tool-call counts. Any anomaly exits with nonzero. For each
session, the audit reports:
| Category | Reported |
|---|---|
| Identity | Session id, LLM, reasoning effort, CLI version, test prompt ID if present |
| Emission Counts | User messages, commentary updates, final answers, reasoning blocks |
| API Call Counts | web_search calls, function/tool calls, by tool name |
| Duplicate Detection | Any final answer generated more than once, whether event_msg,response_item, task_complete.last_agent_message copies match |
| Post-completion Records | Anything appended after the last task_complete |
| Timing | Duration, time to first token, wall clock between first-last record |
| Token Usage | From final token_count event |
# Audit a test's rollouts
python scripts/rollout_audit.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl
# Audit all rollouts for a track, write a CSV
python scripts/rollout_audit.py results/vscode-codex-interpreted/artifacts/rollouts/*/*.jsonl --csv audit.csv
Rollout Decode
rollout_decode.py converts logs into three readable views:
--timeline, default: chronological summary of events, tool calls, and messages--census: record and payload type inventory with field frequencies--pretty: full indented JSON of every record, with encrypted reasoning blobs elided
timelineoutput distinguishes UI-facing events (AGENT,WEB,SHELL) from the LLM-facing transcript copies (AGENT*,WEB*,FINAL*).THINKblocks encrypted and unreadable;TOKENSrows are cumulative session usage checkpoints.
# Timeline for a test
python scripts/rollout_decode.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --timeline
# Census: what record and payload types exist in logs
python scripts/rollout_decode.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --census
# Pretty-print only web_search_call records
python scripts/rollout_decode.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --pretty --grep web_search_call
# Write timeline to a Markdown file
python scripts/rollout_decode.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --timeline --md decoded.md
Logging
Run the interactive logger and follow the prompts. Fields grouped by track:
session fields first, then track-specific output fields, then hypothesis and notes.
Quotation marks not necessary; skip optional fields with Enter:
# Call logger
python scripts/log.py
# Logger prompts-validates fields before writing
✓ Result logged to results/codex-{track}/results.csv
Verify key metrics before logging raw track runs with
python scripts/verify.py {test_id}. When logging Track 2 results, pull the matching Track 1 record withpython scripts/query.py --test {test_id} --models {model}.
Framework Fields
| Column | Description | Example |
|---|---|---|
test_id |
Test identifier | BL-1, SC-2, EC-1 |
timestamp |
ISO 8601 format |
2026-03-16T17:05:02.998376 |
date |
Date tested | 2026-03-16 |
url |
Full URL tested | https://www.mongodb.com/docs... |
track |
Test track | t1_codex_interpreted, t3_codex_raw |
surface |
Deployment surface | codex, vscode_codex |
method |
Retrieval method | gpt-interpreted, raw |
workspace_present |
Workspace available to agent? | true/false |
permission_level |
Agent permission setting | default, auto-review, full-access |
model_observed |
LLM reported in output | GPT-5.5 |
model_intelligence_level |
LLM intelligence setting | low, medium, high, extra high |
input_est_chars |
Expected input size in characters | 87040 |
hypothesis_match |
Hypothesis success/failure | H1-no, H2-yes, H4-untested |
codex_version |
Codex version string | 1.0.0 |
notes |
Observations | web tool invoked |
tools_named |
Tool names reported in agent output | web, web.open, curl |
workspace_substitution |
Local execution instead of web fetch? | yes/no/unknown |
output_chars |
T1/T2: agent-measured output length |
27890 |
truncated |
T1/T2: truncation status |
yes/no/mixed/implicit |
truncation_note |
T1/T2: location, layer, or characterization |
web.open partial, curl complete |
tokens_est |
T1/T2: estimated token count |
16890 |
tools_used* |
T3/T4: observed tool chain |
web -> web.open |
tools_blocked* |
T3/T4: tools requested, but skipped |
curl |
execution_attempts* |
T3/T4: total tool calls, fallbacks |
3 |
escalation_trigger* |
T3/T4: what drove tool escalation |
automatic, contaminated, none, reasoned |
artifact_path* |
T3/T4: path of agent-written file |
/private/tmp/bl1_response.html |
artifact_size_bytes* |
T3/T4: agent-written file size |
505339 |
last_50_chars* |
T3/T4: retrieved content verbatim;cross-reference via verify.py |
])</script></body></html> |
agent_reported_output_chars* |
T3/T4: agent-measured char count |
9876 |
agent_reported_truncated* |
T3/T4: agent-measured truncation status |
yes/no/mixed/implicit |
agent_reported_truncation_note* |
T3/T4: agent-reported location,layer or characterization |
curl complete, web.open partial at L477 |
agent_reported_tokens_est* |
T3/T4: agent-estimated token count |
2469 |
agent_reported_file_size_bytes* |
T3/T4: agent-measured file size |
4817 |
agent_reported_md5_checksum* |
T3/T4: agent-measured MD5 |
abc123... |
agent_reported_lines* |
T3/T4: agent-measured line count |
143 |
agent_reported_words* |
T3/T4: agent-measured word count |
564 |
agent_reported_code_blocks* |
T3/T4: agent-measured code block count |
2 |
agent_reported_table_rows* |
T3/T4: agent-measured table row count |
57 |
agent_reported_headers* |
T3/T4: agent-measured header count |
4 |
verified_file_size_bytes* |
T3/T4: verifier-measured file size |
4817 |
verified_md5_checksum* |
T3/T4: verifier-measured MD5 |
d6ad8451d3778bf3544574... |
verified_total_lines* |
T3/T4: verifier-measured line count |
143 |
verified_total_words* |
T3/T4: verifier-measured word count |
564 |
verified_tokens* |
T3/T4: verifier-measured token count |
197 |
verified_chars_per_token* |
T3/T4: verifier-measured chars/token ratio |
4.43 |
verified_code_blocks* |
T3/T4: verifier-measured code block count |
2 |
verified_table_rows* |
T3/T4: verifier-measured table row count |
57 |
verified_headers* |
T3/T4: verifier-measured header count |
4 |
*Optional field, raw tracks only.
agent_reported*fields reflect tool output or payload estimates.verify.pycalculatesverified*values againstraw_output_{test_id}.txtfiles.
Analysis
Examine hypothesis matching, surface-workspace effects, perception gap, and truncation analysis:
# Single track full analysis or summary
python scripts/analyze.py --csv results/codex-interpreted/results.csv --summary
python scripts/analyze.py --csv results/codex-raw/results.csv --full
# Filter by track
python scripts/analyze.py --csv results/codex-interpreted/results.csv --track t1_codex_interpreted
# Compare interpreted tracks T1 vs T2
python scripts/analyze.py \
--csv results/codex_interpreted/results.csv \
results/vscode-codex-interpreted/results.csv --full
# Compare raw tracks T3 vs T4
python scripts/analyze.py \
--csv results/codex_raw/results.csv \
results/vscode-codex-raw/results.csv --full
# Compare all tracks
python scripts/analyze.py \
--csv results/codex-interpreted/results.csv \
results/vscode-codex-interpreted/results.csv \
results/codex-raw/results.csv \
results/vscode-codex-raw/results.csv --full
Provide the full relative path including subdirectory,
results/codex-interpreted/results.csv
Agent Ecosystem Testing