Codex Framework Reference

This framework generates standardized test prompts and logs CSV results, enabling consistent testing across cases, measurement tracking over time, truncation pattern identification, and retrieval behavior comparisons across tracks: Codex IDE and VS Code-Codex interpreted, Codex IDE and VS Code-Codex raw.
Requirements: Python 3.8+, OpenAI Codex, and VS Code Codex extension

Installation

# Clone and/or navigate to `agent-ecosystem-testing` directory
cd agent-ecosystem-testing

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
# Windows: venv\Scripts\activate
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Navigate to the Codex testing directory
cd open-ai-codex-web-search

For whatever reason, such as incompatible Python versions or some accidental corruption, use rm -rf venv to remove the venv and start over

Workflow

List Available Tests and Tracks

python framework.py --list-tests
python framework.py --list-tracks

Generate Test Prompt for a Single Test

Print a formatted test harness with a structured prompt to copy into the Codex chat window, fields requiring values, and expected size reference:

# T1: GPT-interpreted, Codex IDE
python framework.py --test BL-1 --track codex-interpreted

# T2: GPT-interpreted, VS Code-Codex
python framework.py --test BL-1 --track vscode_interpreted

# T3: Raw verbatim output, Codex IDE
python framework.py --test BL-1 --track codex_raw

# T4: Raw verbatim output, VS Code-Codex
python framework.py --test BL-1 --track vscode_raw

Copy Prompt → Run in Codex
- Review the terminal output → copy the prompt
- Open the Codex IDE or VS Code-Codex chat window → paste the prompt
- Inspect retrieval behavior → examine agent output

Assess Hypotheses

Before logging test results, assess the run against each hypothesis based on the agent’s self-reported metrics and tool visibility output:

ID	Description	Question
`H1`	Character-based truncation at fixed limit	Is there a ceiling at ~10–100 KB?
`H2`	Token-based truncation	Is there a ceiling at ~2,000 tokens?
`H3`	Structure-aware truncation	Does truncation fall on Markdown boundaries rather than arbitrary byte positions?
`H4`	Surface impact on retrieval behavior	Does the Codex IDE versus VS Code-Codex surface produce different retrieval behavior?
`H5`	Auto-chunking and/or pagination	Does the agent fetch with multi-step tool chains, or only when reasoned into it?

Log Results

Run the interactive logger and follow the prompts. Fields grouped by track: session fields first, then track-specific output fields, then hypothesis and notes. Quotation marks not necessary; skip optional fields with Enter:

# Call the logger
python log.py

# Logger prompts and validates fields before writing
✓ Result logged to results/codex-{track}/results.csv

Verify key metrics before logging raw track runs python verify.py BL-1 --surface codex or
python verify.py BL-1 --surface vscode

Framework fields logged per track:

Column	Description	Example
`test_id`	Test identifier	`BL-1`, `SC-2`, `EC-1`
`timestamp`	`ISO 8601` format	`2026-03-16T17:05:02.998376`
`date`	Date tested	`2026-03-16`
`url`	Full URL tested	`https://www.mongodb.com/docs...`
`track`	Test track	`t1_codex_interpreted`, `t3_codex_raw`
`surface`	Deployment surface	`codex`, `vscode_codex`
`method`	Retrieval method	`gpt-interpreted`, `raw`
`workspace_present`	Workspace available to agent?	`true`/`false`
`permission_level`	Agent permission setting	`default`, `auto-review`, `full-access`
`model_observed`	LLM reported in output	`GPT-5.5`
`model_intelligence_level`	LLM intelligence setting	`low`, `medium`, `high`, `extra high`
`input_est_chars`	Expected input size in characters	`87040`
`hypothesis_match`	Hypothesis success/failure	`H1-no`, `H2-yes`, `H4-untested`
`codex_version`	Codex version string	`1.0.0`
`notes`	Observations	`web tool invoked`
`tools_named`	Tool names reported in agent output	`web`, `web.open`, `curl`
`workspace_substitution`	Local execution instead of web fetch?	`yes`/`no`/`unknown`
`output_chars`	`T1`/`T2`: agent-measured output length	`27890`
`truncated`	`T1`/`T2`: truncation detected	`yes`/`no`
`truncation_point`	`T1`/`T2`: section/line truncation spot	`L477`
`tokens_est`	`T1`/`T2`: estimated token count	`16890`
`tools_used`*	`T3`/`T4`: observed tool chain	`web -> web.open`
`tools_blocked`*	`T3`/`T4`: tools requested, but skipped	`curl`
`execution_attempts`*	`T3`/`T4`: total tool calls, fallbacks	`3`
`agent_reported_output_chars`*	`T3`/`T4`: agent-measured char count	`9876`
`agent_reported_truncated`*	`T3`/`T4`: agent-measured truncation	`yes`/`no`
`agent_reported_tokens_est`*	`T3`/`T4`: agent-estimated token count	`2469`
`agent_reported_file_size_bytes`*	`T3`/`T4`: agent-measured file size: bytes	`4817`
`agent_reported_md5_checksum`*	`T3`/`T4`: agent-measured MD5	`abc123...`
`agent_reported_lines`*	`T3`/`T4`: agent-measured line count	`143`
`agent_reported_words`*	`T3`/`T4`: agent-measured word count	`564`
`agent_reported_code_blocks`*	`T3`/`T4`: agent-measured code block count	`2`
`agent_reported_table_rows`*	`T3`/`T4`: agent-measured table row count	`57`
`agent_reported_headers`*	`T3`/`T4`: agent-measured header count	`4`
`verified_file_size_bytes`*	`T3`/`T4`: verifier-measured file size: bytes	`4817`
`verified_md5_checksum`*	`T3`/`T4`: verifier-measured MD5	`d6ad8451d3778bf3544574...`
`verified_total_lines`*	`T3`/`T4`: verifier-measured line count	`143`
`verified_total_words`*	`T3`/`T4`: verifier-measured word count	`564`
`verified_tokens`*	`T3`/`T4`: verifier-measured token count	`197`
`verified_chars_per_token`*	`T3`/`T4`: verifier-measured chars/token ratio	`4.43`
`verified_code_blocks`*	`T3`/`T4`: verifier-measured code block count	`2`
`verified_table_rows`*	`T3`/`T4`: verifier-measured table row count	`57`
`verified_headers`*	`T3`/`T4`: verifier-measured header count	`4`

*Optional field, raw tracks only. agent_reported_* fields may reflect tool output or payload estimates verify.py calculates verified_* values against raw_output_{test_id}.txt files.

Baseline Testing Path

Run T1 to establish surface-isolated behavioral baseline
Run T2 to isolate workspace effect against T1
Run T3 for ground truth retrieval measurements, verify T1
Run T4 to isolate surface effect on raw retrieval, verify T2
Run each test a minimum of 5 times per track to capture variance

Test IDs	Purpose	Key Question
`BL-1` `BL-2`	Baseline truncation threshold on small pages	What is the T1 vs T2 surface delta?
`SC-2`	Code blocks, API documentation	How does the web toolchain handle code structure?
`OP-1`	Fragment identifier navigation	Does Codex jump to a specific section via URL fragment?
`OP-4`	Auto-chunking above the `BL-3` ceiling	Does the agent fetch with multi-step tool chains?
`BL-3`	Hard ceiling	What is the absolute output limit across retrieval runs?
`SC-1` `SC-3` `SC-4`	Structured content	Does truncation respect Markdown boundaries?
`EC-1` `EC-3` `EC-6`	Edge cases	What are the failure modes and workspace substitution edge behaviors?

Rename raw output files to capture variance across runs; if results are consistent, remove files to prevent test contamination between runs

Analyzing Results

Examine hypothesis matching, surface-workspace effects, perception gap, and truncation analysis:

# Single track: full analysis or summary
python analyze.py --csv results/codex_interpreted/results.csv --summary
python analyze.py --csv results/codex_raw/results.csv --full

# Filter by track
python analyze.py --csv results/codex_interpreted/results.csv --track t1_codex_interpreted

# Compare interpreted tracks, T1 vs T2, isolates workspace effect
python analyze.py \
   --csv results/codex_interpreted/results.csv \
         results/vscode_codex_interpreted/results.csv --full

# Compare raw tracks, T3 vs T4, isolates surface effect on retrieval ceiling
python analyze.py \
   --csv results/codex_raw/results.csv \
         results/vscode_codex_raw/results.csv --full

# Compare all four tracks
python analyze.py \
   --csv results/codex_interpreted/results.csv \
         results/vscode_codex_interpreted/results.csv \
         results/codex_raw/results.csv \
         results/vscode_codex_raw/results.csv --full

Provide the full relative path including subdirectory - results/codex_interpreted/results.csv

Generating Summary Templates

Generate pre-structured Markdown summary templates to fill in after each test series:

# Single test, single track
python template.py --test BL-1 --track codex_raw

# All four tracks for a single test
python template.py --test BL-1 --all-tracks

# All tests for a single track
python template.py --track codex_raw --all-tests

# All 48 combinations
python template.py --all-tests --all-tracks

# Regenerate a template after changes to `TEST_URLS` or `TRACKS`
python template.py --test BL-1 --track codex_raw --overwrite

# Preview without writing a file
python template.py --test BL-1 --track codex_raw --preview

Templates written to summaries/{track}/{test_id}_summary.md. Each template pre-populates the test conditions table, run results table with track-appropriate columns, H1–H5 hypothesis sections with verdict placeholders, an emergent findings scaffold, and a log label summary table. Verdict reasoning, emergent findings prose, and log labels left as  placeholders for human completion.