Agent Ecosystem Testing

Codex Framework Reference

This framework generates standardized test prompts and logs CSV results, enabling consistent testing across cases, measurement tracking over time, truncation pattern identification, and retrieval behavior comparisons across tracks:
Codex and VS Code-Codex interpreted, Codex and VS Code-Codex raw
.
Requirements: Python 3.8+, OpenAI Codex Desktop, and VS Code Codex Extension


Installation

# Clone and/or navigate to agent-ecosystem-testing directory
cd agent-ecosystem-testing

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
# Windows: venv\Scripts\activate
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Navigate to Codex testing directory
cd open-ai-codex-web-search

In the event of incompatible Python versions or corruption, use rm -rf venv to remove the venv to start over


Baseline Testing Path

  1. Run T1 to establish surface-isolated behavioral baseline
  2. Run T2 to isolate workspace effect against T1
  3. Run T3 for ground truth retrieval measurements, verify T1
  4. Run T4 to isolate surface effect on raw retrieval, verify T2
  5. Run each test a minimum of 5x/track to capture variance
Test IDs Purpose Key Question
BL-1
BL-2
Baseline truncation threshold
on small pages
What’s the T1 vs T2 surface delta?
SC-2 API docs - code blocks How does the web toolchain handle code structure?
OP-1 Fragment identifier navigation Does Codex jump to a specific section via URL fragment?
OP-2 Midrange reference -
headings, code blocks
Does behavior change at midrange size with structured content?
OP-4 Auto-chunking above
BL-3 ceiling
Does the agent fetch with multi-step tool chains?
BL-3 Hard ceiling What’s the absolute output limit across retrieval runs?
SC-1
SC-3
SC-4
Structured content -
API docs, table-heavy,
nested headings
Does truncation respect Markdown boundaries?
EC-1
EC-3
EC-6
Edge cases - line-wrapping,
JSON redirect, SPA
What are the failure modes and workspace substitution edge behaviors?

Rename output files to capture variance; if results are consistent, remove files to prevent contamination


Workflow

  1. List Available Tests and Tracks

    python scripts/framework.py --list-tests
    python scripts/framework.py --list-tracks
    
  2. Generate Test Prompt for a Single Test

    # T1: GPT-interpreted, Codex Desktop
    python scripts/framework.py --test BL-1 --track codex-interpreted
    
    # T2: GPT-interpreted, Codex Extension
    python scripts/framework.py --test BL-1 --track vscode-codex-interpreted
    
    # T3: Raw verbatim output, Codex Desktop
    python scripts/framework.py --test BL-1 --track codex-raw
    
    # T4: Raw verbatim output, Codex Extension
    python scripts/framework.py --test BL-1 --track vscode-codex-raw
    
  3. Copy Prompt → Run in Codex

    • Review the terminal output → copy the prompt
    • Open the Codex IDE or VS Code-Codex chat window → paste the prompt
    • Inspect retrieval behavior → examine agent output
  4. Assess Hypotheses Against Agent Output

    ID Description Question
    H1 Character-based truncation at fixed limit Is there a ceiling at ~10–100 KB?
    H2 Token-based truncation Is there a ceiling at ~2,000 tokens?
    H3 Structure-aware truncation Does truncation fall on Markdown boundaries rather than
    arbitrary byte positions?
    H4 Surface impact on retrieval behavior Does the Codex IDE versus VS Code-Codex surface
    produce different retrieval behavior?
    H5 Auto-chunking and/or pagination Does the agent fetch with multi-step tool chains, or
    only when reasoned into it?
  5. Examine, Log, Analyze


Rollout Observability

Examine ~/.codex/sessions/rollouts logs for session structure and anomalies. Point scripts at results/{track}/artifacts/rollouts for parsing.

results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-2026-06-11T14-08-50-....jsonl

Session Overview

session_reader.py produces a structured report from one or more rollouts including session metadata, model, sandbox policy, skills, token usage, tool calls, reasoning presence, and the conversation.

# Text report to stdout
python scripts/session_reader.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl

# HTML report
python scripts/session_reader.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl -o report.html

# List sessions and filter by ID
python scripts/session_reader.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --list-sessions
python scripts/session_reader.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --session-id <id>

Rollout Audit

rollout_audit.py checks logs for duplicate emissions, timing drift, live event stream and transcript mismatches, post-completion records, and tool-call counts. Any anomaly exits with nonzero. For each session, the audit reports:

Category Reported
Identity Session id, LLM, reasoning effort, CLI version, test prompt ID if present
Emission Counts User messages, commentary updates, final answers, reasoning blocks
API Call Counts web_search calls, function/tool calls, by tool name
Duplicate Detection Any final answer generated more than once, whether event_msg,
response_item, task_complete.last_agent_message copies match
Post-completion Records Anything appended after the last task_complete
Timing Duration, time to first token, wall clock between first-last record
Token Usage From final token_count event
# Audit a test's rollouts
python scripts/rollout_audit.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl

# Audit all rollouts for a track, write a CSV
python scripts/rollout_audit.py results/vscode-codex-interpreted/artifacts/rollouts/*/*.jsonl --csv audit.csv

Rollout Decode

rollout_decode.py converts logs into three readable views:

  • --timeline, default: chronological summary of events, tool calls, and messages
  • --census: record and payload type inventory with field frequencies
  • --pretty: full indented JSON of every record, with encrypted reasoning blobs elided

timeline output distinguishes UI-facing events (AGENT, WEB, SHELL) from the LLM-facing transcript copies (AGENT*, WEB*, FINAL*). THINK blocks encrypted and unreadable; TOKENS rows are cumulative session usage checkpoints.

# Timeline for a test
python scripts/rollout_decode.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --timeline

# Census: what record and payload types exist in logs
python scripts/rollout_decode.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --census

# Pretty-print only web_search_call records
python scripts/rollout_decode.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --pretty --grep web_search_call

# Write timeline to a Markdown file
python scripts/rollout_decode.py results/vscode-codex-interpreted/artifacts/rollouts/SC-2/rollout-*.jsonl --timeline --md decoded.md

Logging

Run the interactive logger and follow the prompts. Fields grouped by track: session fields first, then track-specific output fields, then hypothesis and notes. Quotation marks not necessary; skip optional fields with Enter:

# Call logger
python scripts/log.py

# Logger prompts-validates fields before writing
✓ Result logged to results/codex-{track}/results.csv

Verify key metrics before logging raw track runs with python scripts/verify.py {test_id}. When logging Track 2 results, pull the matching Track 1 record with python scripts/query.py --test {test_id} --models {model}.

Framework Fields

Column Description Example
test_id Test identifier BL-1, SC-2, EC-1
timestamp ISO 8601 format 2026-03-16T17:05:02.998376
date Date tested 2026-03-16
url Full URL tested https://www.mongodb.com/docs...
track Test track t1_codex_interpreted, t3_codex_raw
surface Deployment surface codex, vscode_codex
method Retrieval method gpt-interpreted, raw
workspace_present Workspace available to agent? true/false
permission_level Agent permission setting default, auto-review, full-access
model_observed LLM reported in output GPT-5.5
model_intelligence_level LLM intelligence setting low, medium, high, extra high
input_est_chars Expected input size in characters 87040
hypothesis_match Hypothesis success/failure H1-no, H2-yes, H4-untested
codex_version Codex version string 1.0.0
notes Observations web tool invoked
tools_named Tool names reported in agent output web, web.open, curl
workspace_substitution Local execution instead of web fetch? yes/no/unknown
output_chars T1/T2: agent-measured output length 27890
truncated T1/T2: truncation status yes/no/mixed/implicit
truncation_note T1/T2: location, layer, or characterization web.open partial, curl complete
tokens_est T1/T2: estimated token count 16890
tools_used* T3/T4: observed tool chain web -> web.open
tools_blocked* T3/T4: tools requested, but skipped curl
execution_attempts* T3/T4: total tool calls, fallbacks 3
escalation_trigger* T3/T4: what drove tool escalation automatic, contaminated, none, reasoned
artifact_path* T3/T4: path of agent-written file /private/tmp/bl1_response.html
artifact_size_bytes* T3/T4: agent-written file size 505339
last_50_chars* T3/T4: retrieved content verbatim;
cross-reference via verify.py
])</script></body></html>
agent_reported_output_chars* T3/T4: agent-measured char count 9876
agent_reported_truncated* T3/T4: agent-measured truncation status yes/no/mixed/implicit
agent_reported_truncation_note* T3/T4: agent-reported location,
layer or characterization
curl complete, web.open partial at L477
agent_reported_tokens_est* T3/T4: agent-estimated token count 2469
agent_reported_file_size_bytes* T3/T4: agent-measured file size 4817
agent_reported_md5_checksum* T3/T4: agent-measured MD5 abc123...
agent_reported_lines* T3/T4: agent-measured line count 143
agent_reported_words* T3/T4: agent-measured word count 564
agent_reported_code_blocks* T3/T4: agent-measured code block count 2
agent_reported_table_rows* T3/T4: agent-measured table row count 57
agent_reported_headers* T3/T4: agent-measured header count 4
verified_file_size_bytes* T3/T4: verifier-measured file size 4817
verified_md5_checksum* T3/T4: verifier-measured MD5 d6ad8451d3778bf3544574...
verified_total_lines* T3/T4: verifier-measured line count 143
verified_total_words* T3/T4: verifier-measured word count 564
verified_tokens* T3/T4: verifier-measured token count 197
verified_chars_per_token* T3/T4: verifier-measured chars/token ratio 4.43
verified_code_blocks* T3/T4: verifier-measured code block count 2
verified_table_rows* T3/T4: verifier-measured table row count 57
verified_headers* T3/T4: verifier-measured header count 4

*Optional field, raw tracks only. agent_reported* fields reflect tool output or payload estimates.
verify.py calculates verified* values against raw_output_{test_id}.txt files.


Analysis

Examine hypothesis matching, surface-workspace effects, perception gap, and truncation analysis:

# Single track full analysis or summary
python scripts/analyze.py --csv results/codex-interpreted/results.csv --summary
python scripts/analyze.py --csv results/codex-raw/results.csv --full

# Filter by track
python scripts/analyze.py --csv results/codex-interpreted/results.csv --track t1_codex_interpreted

# Compare interpreted tracks T1 vs T2
python scripts/analyze.py \
   --csv results/codex_interpreted/results.csv \
         results/vscode-codex-interpreted/results.csv --full

# Compare raw tracks T3 vs T4
python scripts/analyze.py \
   --csv results/codex_raw/results.csv \
         results/vscode-codex-raw/results.csv --full

# Compare all tracks
python scripts/analyze.py \
   --csv results/codex-interpreted/results.csv \
         results/vscode-codex-interpreted/results.csv \
         results/codex-raw/results.csv \
         results/vscode-codex-raw/results.csv --full

Provide the full relative path including subdirectory, results/codex-interpreted/results.csv