Codex Framework Reference
This framework generates standardized test prompts and logs CSV results, enabling consistent testing across cases, measurement tracking over time, truncation pattern identification, and retrieval behavior comparisons across tracks: Codex IDE and VS Code-Codex interpreted, Codex IDE and VS Code-Codex raw.
Requirements: Python 3.8+, OpenAI Codex, and VS Code Codex extension
Installation
# Clone and/or navigate to `agent-ecosystem-testing` directory
cd agent-ecosystem-testing
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# Windows: venv\Scripts\activate
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Navigate to the Codex testing directory
cd open-ai-codex-web-search
For whatever reason, such as incompatible Python versions or some accidental corruption, use
rm -rf venvto remove thevenvand start over
Workflow
-
List Available Tests and Tracks
python framework.py --list-tests python framework.py --list-tracks -
Generate Test Prompt for a Single Test
Print a formatted test harness with a structured prompt to copy into the Codex chat window, fields requiring values, and expected size reference:
# T1: GPT-interpreted, Codex IDE python framework.py --test BL-1 --track codex-interpreted # T2: GPT-interpreted, VS Code-Codex python framework.py --test BL-1 --track vscode_interpreted # T3: Raw verbatim output, Codex IDE python framework.py --test BL-1 --track codex_raw # T4: Raw verbatim output, VS Code-Codex python framework.py --test BL-1 --track vscode_raw -
Copy Prompt → Run in Codex
- Review the terminal output → copy the prompt
- Open the Codex IDE or VS Code-Codex chat window → paste the prompt
- Inspect retrieval behavior → examine agent output
-
Assess Hypotheses
Before logging test results, assess the run against each hypothesis based on the agent’s self-reported metrics and tool visibility output:
ID Description Question H1Character-based truncation at fixed limit Is there a ceiling at ~10–100 KB? H2Token-based truncation Is there a ceiling at ~2,000 tokens? H3Structure-aware truncation Does truncation fall on Markdown boundaries rather than
arbitrary byte positions?H4Surface impact on retrieval behavior Does the Codex IDE versus VS Code-Codex surface
produce different retrieval behavior?H5Auto-chunking and/or pagination Does the agent fetch with multi-step tool chains, or
only when reasoned into it? -
Log Results
Run the interactive logger and follow the prompts. Fields grouped by track: session fields first, then track-specific output fields, then hypothesis and notes. Quotation marks not necessary; skip optional fields with
Enter:# Call the logger python log.py # Logger prompts and validates fields before writing ✓ Result logged to results/codex-{track}/results.csvVerify key metrics before logging raw track runs
python verify.py BL-1 --surface codexorpython verify.py BL-1 --surface vscodeFramework fields logged per track:
Column Description Example test_idTest identifier BL-1,SC-2,EC-1timestampISO 8601format2026-03-16T17:05:02.998376dateDate tested 2026-03-16urlFull URL tested https://www.mongodb.com/docs...trackTest track t1_codex_interpreted,t3_codex_rawsurfaceDeployment surface codex,vscode_codexmethodRetrieval method gpt-interpreted,rawworkspace_presentWorkspace available to agent? true/falsepermission_levelAgent permission setting default,auto-review,full-accessmodel_observedLLM reported in output GPT-5.5model_intelligence_levelLLM intelligence setting low,medium,high,extra highinput_est_charsExpected input size in characters 87040hypothesis_matchHypothesis success/failure H1-no,H2-yes,H4-untestedcodex_versionCodex version string 1.0.0notesObservations web tool invokedtools_namedTool names reported in agent output web,web.open,curlworkspace_substitutionLocal execution instead of web fetch? yes/no/unknownoutput_charsT1/T2: agent-measured output length27890truncatedT1/T2: truncation detectedyes/notruncation_pointT1/T2: section/line truncation spotL477tokens_estT1/T2: estimated token count16890tools_used*T3/T4: observed tool chainweb -> web.opentools_blocked*T3/T4: tools requested, but skippedcurlexecution_attempts*T3/T4: total tool calls, fallbacks3agent_reported_output_chars*T3/T4: agent-measured char count9876agent_reported_truncated*T3/T4: agent-measured truncationyes/noagent_reported_tokens_est*T3/T4: agent-estimated token count2469agent_reported_file_size_bytes*T3/T4: agent-measured file size: bytes4817agent_reported_md5_checksum*T3/T4: agent-measured MD5abc123...agent_reported_lines*T3/T4: agent-measured line count143agent_reported_words*T3/T4: agent-measured word count564agent_reported_code_blocks*T3/T4: agent-measured
code block count2agent_reported_table_rows*T3/T4: agent-measured
table row count57agent_reported_headers*T3/T4: agent-measured
header count4verified_file_size_bytes*T3/T4: verifier-measured
file size: bytes4817verified_md5_checksum*T3/T4: verifier-measured MD5d6ad8451d3778bf3544574...verified_total_lines*T3/T4: verifier-measured line count143verified_total_words*T3/T4: verifier-measured word count564verified_tokens*T3/T4: verifier-measured token count197verified_chars_per_token*T3/T4: verifier-measured
chars/token ratio4.43verified_code_blocks*T3/T4: verifier-measured
code block count2verified_table_rows*T3/T4: verifier-measured
table row count57verified_headers*T3/T4: verifier-measured
header count4*Optional field, raw tracks only.
agent_reported_*fields may reflect tool output or payload estimatesverify.pycalculatesverified_*values againstraw_output_{test_id}.txtfiles.
Baseline Testing Path
- Run
T1to establish surface-isolated behavioral baseline - Run
T2to isolate workspace effect againstT1 - Run
T3for ground truth retrieval measurements, verifyT1 - Run
T4to isolate surface effect on raw retrieval, verifyT2 - Run each test a minimum of 5 times per track to capture variance
| Test IDs | Purpose | Key Question |
|---|---|---|
BL-1BL-2 |
Baseline truncation threshold on small pages |
What is the T1 vs T2 surface delta? |
SC-2 |
Code blocks, API documentation |
How does the web toolchain handle code structure? |
OP-1 |
Fragment identifier navigation |
Does Codex jump to a specific section via URL fragment? |
OP-4 |
Auto-chunking above the BL-3 ceiling |
Does the agent fetch with multi-step tool chains? |
BL-3 |
Hard ceiling | What is the absolute output limit across retrieval runs? |
SC-1SC-3SC-4 |
Structured content | Does truncation respect Markdown boundaries? |
EC-1EC-3EC-6 |
Edge cases | What are the failure modes and workspace substitution edge behaviors? |
Rename raw output files to capture variance across runs; if results are consistent, remove files to prevent test contamination between runs
Analyzing Results
Examine hypothesis matching, surface-workspace effects, perception gap, and truncation analysis:
# Single track: full analysis or summary
python analyze.py --csv results/codex_interpreted/results.csv --summary
python analyze.py --csv results/codex_raw/results.csv --full
# Filter by track
python analyze.py --csv results/codex_interpreted/results.csv --track t1_codex_interpreted
# Compare interpreted tracks, T1 vs T2, isolates workspace effect
python analyze.py \
--csv results/codex_interpreted/results.csv \
results/vscode_codex_interpreted/results.csv --full
# Compare raw tracks, T3 vs T4, isolates surface effect on retrieval ceiling
python analyze.py \
--csv results/codex_raw/results.csv \
results/vscode_codex_raw/results.csv --full
# Compare all four tracks
python analyze.py \
--csv results/codex_interpreted/results.csv \
results/vscode_codex_interpreted/results.csv \
results/codex_raw/results.csv \
results/vscode_codex_raw/results.csv --full
Provide the full relative path including subdirectory -
results/codex_interpreted/results.csv
Generating Summary Templates
Generate pre-structured Markdown summary templates to fill in after each test series:
# Single test, single track
python template.py --test BL-1 --track codex_raw
# All four tracks for a single test
python template.py --test BL-1 --all-tracks
# All tests for a single track
python template.py --track codex_raw --all-tests
# All 48 combinations
python template.py --all-tests --all-tracks
# Regenerate a template after changes to `TEST_URLS` or `TRACKS`
python template.py --test BL-1 --track codex_raw --overwrite
# Preview without writing a file
python template.py --test BL-1 --track codex_raw --preview
Templates written to summaries/{track}/{test_id}_summary.md. Each template
pre-populates the test conditions table, run results table with track-appropriate
columns, H1–H5 hypothesis sections with verdict placeholders, an emergent findings
scaffold, and a log label summary table. Verdict reasoning, emergent findings prose,
and log labels left as <!-- TODO --> placeholders for human completion.
Agent Ecosystem Testing