Agent Ecosystem Testing

Codex Framework Reference

This framework generates standardized test prompts and logs CSV results, enabling consistent testing across cases, measurement tracking over time, truncation pattern identification, and retrieval behavior comparisons across tracks: Codex IDE and VS Code-Codex interpreted, Codex IDE and VS Code-Codex raw.
Requirements: Python 3.8+, OpenAI Codex, and VS Code Codex extension


Installation

# Clone and/or navigate to `agent-ecosystem-testing` directory
cd agent-ecosystem-testing

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
# Windows: venv\Scripts\activate
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Navigate to the Codex testing directory
cd open-ai-codex-web-search

For whatever reason, such as incompatible Python versions or some accidental corruption, use rm -rf venv to remove the venv and start over


Workflow

  1. List Available Tests and Tracks

    python framework.py --list-tests
    python framework.py --list-tracks
    
  2. Generate Test Prompt for a Single Test

    Print a formatted test harness with a structured prompt to copy into the Codex chat window, fields requiring values, and expected size reference:

    # T1: GPT-interpreted, Codex IDE
    python framework.py --test BL-1 --track codex-interpreted
    
    # T2: GPT-interpreted, VS Code-Codex
    python framework.py --test BL-1 --track vscode_interpreted
    
    # T3: Raw verbatim output, Codex IDE
    python framework.py --test BL-1 --track codex_raw
    
    # T4: Raw verbatim output, VS Code-Codex
    python framework.py --test BL-1 --track vscode_raw
    
  3. Copy Prompt → Run in Codex

    • Review the terminal output → copy the prompt
    • Open the Codex IDE or VS Code-Codex chat window → paste the prompt
    • Inspect retrieval behavior → examine agent output
  4. Assess Hypotheses

    Before logging test results, assess the run against each hypothesis based on the agent’s self-reported metrics and tool visibility output:

    ID Description Question
    H1 Character-based truncation at fixed limit Is there a ceiling at ~10–100 KB?
    H2 Token-based truncation Is there a ceiling at ~2,000 tokens?
    H3 Structure-aware truncation Does truncation fall on Markdown boundaries rather than
    arbitrary byte positions?
    H4 Surface impact on retrieval behavior Does the Codex IDE versus VS Code-Codex surface
    produce different retrieval behavior?
    H5 Auto-chunking and/or pagination Does the agent fetch with multi-step tool chains, or
    only when reasoned into it?
  5. Log Results

    Run the interactive logger and follow the prompts. Fields grouped by track: session fields first, then track-specific output fields, then hypothesis and notes. Quotation marks not necessary; skip optional fields with Enter:

    # Call the logger
    python log.py
    
    # Logger prompts and validates fields before writing
    ✓ Result logged to results/codex-{track}/results.csv
    

    Verify key metrics before logging raw track runs python verify.py BL-1 --surface codex or
    python verify.py BL-1 --surface vscode

    Framework fields logged per track:

    Column Description Example
    test_id Test identifier BL-1, SC-2, EC-1
    timestamp ISO 8601 format 2026-03-16T17:05:02.998376
    date Date tested 2026-03-16
    url Full URL tested https://www.mongodb.com/docs...
    track Test track t1_codex_interpreted, t3_codex_raw
    surface Deployment surface codex, vscode_codex
    method Retrieval method gpt-interpreted, raw
    workspace_present Workspace available to agent? true/false
    permission_level Agent permission setting default, auto-review, full-access
    model_observed LLM reported in output GPT-5.5
    model_intelligence_level LLM intelligence setting low, medium, high, extra high
    input_est_chars Expected input size in characters 87040
    hypothesis_match Hypothesis success/failure H1-no, H2-yes, H4-untested
    codex_version Codex version string 1.0.0
    notes Observations web tool invoked
    tools_named Tool names reported in agent output web, web.open, curl
    workspace_substitution Local execution instead of web fetch? yes/no/unknown
    output_chars T1/T2: agent-measured output length 27890
    truncated T1/T2: truncation detected yes/no
    truncation_point T1/T2: section/line truncation spot L477
    tokens_est T1/T2: estimated token count 16890
    tools_used* T3/T4: observed tool chain web -> web.open
    tools_blocked* T3/T4: tools requested, but skipped curl
    execution_attempts* T3/T4: total tool calls, fallbacks 3
    agent_reported_output_chars* T3/T4: agent-measured char count 9876
    agent_reported_truncated* T3/T4: agent-measured truncation yes/no
    agent_reported_tokens_est* T3/T4: agent-estimated token count 2469
    agent_reported_file_size_bytes* T3/T4: agent-measured file size: bytes 4817
    agent_reported_md5_checksum* T3/T4: agent-measured MD5 abc123...
    agent_reported_lines* T3/T4: agent-measured line count 143
    agent_reported_words* T3/T4: agent-measured word count 564
    agent_reported_code_blocks* T3/T4: agent-measured
    code block count
    2
    agent_reported_table_rows* T3/T4: agent-measured
    table row count
    57
    agent_reported_headers* T3/T4: agent-measured
    header count
    4
    verified_file_size_bytes* T3/T4: verifier-measured
    file size: bytes
    4817
    verified_md5_checksum* T3/T4: verifier-measured MD5 d6ad8451d3778bf3544574...
    verified_total_lines* T3/T4: verifier-measured line count 143
    verified_total_words* T3/T4: verifier-measured word count 564
    verified_tokens* T3/T4: verifier-measured token count 197
    verified_chars_per_token* T3/T4: verifier-measured
    chars/token ratio
    4.43
    verified_code_blocks* T3/T4: verifier-measured
    code block count
    2
    verified_table_rows* T3/T4: verifier-measured
    table row count
    57
    verified_headers* T3/T4: verifier-measured
    header count
    4

    *Optional field, raw tracks only. agent_reported_* fields may reflect tool output or payload estimates verify.py calculates verified_* values against raw_output_{test_id}.txt files.


Baseline Testing Path

  1. Run T1 to establish surface-isolated behavioral baseline
  2. Run T2 to isolate workspace effect against T1
  3. Run T3 for ground truth retrieval measurements, verify T1
  4. Run T4 to isolate surface effect on raw retrieval, verify T2
  5. Run each test a minimum of 5 times per track to capture variance
Test IDs Purpose Key Question
BL-1
BL-2
Baseline truncation threshold
on small pages
What is the T1 vs T2 surface delta?
SC-2 Code blocks,
API documentation
How does the web toolchain handle code structure?
OP-1 Fragment identifier
navigation
Does Codex jump to a specific section via URL fragment?
OP-4 Auto-chunking above
the BL-3 ceiling
Does the agent fetch with multi-step tool chains?
BL-3 Hard ceiling What is the absolute output limit across retrieval runs?
SC-1
SC-3
SC-4
Structured content Does truncation respect Markdown boundaries?
EC-1
EC-3
EC-6
Edge cases What are the failure modes and workspace substitution edge behaviors?

Rename raw output files to capture variance across runs; if results are consistent, remove files to prevent test contamination between runs


Analyzing Results

Examine hypothesis matching, surface-workspace effects, perception gap, and truncation analysis:

# Single track: full analysis or summary
python analyze.py --csv results/codex_interpreted/results.csv --summary
python analyze.py --csv results/codex_raw/results.csv --full

# Filter by track
python analyze.py --csv results/codex_interpreted/results.csv --track t1_codex_interpreted

# Compare interpreted tracks, T1 vs T2, isolates workspace effect
python analyze.py \
   --csv results/codex_interpreted/results.csv \
         results/vscode_codex_interpreted/results.csv --full

# Compare raw tracks, T3 vs T4, isolates surface effect on retrieval ceiling
python analyze.py \
   --csv results/codex_raw/results.csv \
         results/vscode_codex_raw/results.csv --full

# Compare all four tracks
python analyze.py \
   --csv results/codex_interpreted/results.csv \
         results/vscode_codex_interpreted/results.csv \
         results/codex_raw/results.csv \
         results/vscode_codex_raw/results.csv --full

Provide the full relative path including subdirectory - results/codex_interpreted/results.csv


Generating Summary Templates

Generate pre-structured Markdown summary templates to fill in after each test series:

# Single test, single track
python template.py --test BL-1 --track codex_raw

# All four tracks for a single test
python template.py --test BL-1 --all-tracks

# All tests for a single track
python template.py --track codex_raw --all-tests

# All 48 combinations
python template.py --all-tests --all-tracks

# Regenerate a template after changes to `TEST_URLS` or `TRACKS`
python template.py --test BL-1 --track codex_raw --overwrite

# Preview without writing a file
python template.py --test BL-1 --track codex_raw --preview

Templates written to summaries/{track}/{test_id}_summary.md. Each template pre-populates the test conditions table, run results table with track-appropriate columns, H1H5 hypothesis sections with verdict placeholders, an emergent findings scaffold, and a log label summary table. Verdict reasoning, emergent findings prose, and log labels left as <!-- TODO --> placeholders for human completion.