Agent Ecosystem Testing

Copilot Framework Reference

This framework generates standardized test prompts and logs CSV results, enabling consistent testing across cases, measurement tracking over time, truncation pattern identification, and web content retrieval method comparisons
Requirements: Python 3.8+, VS Code GitHub Copilot Extension


Installation

# Clone and/or navigate to `agent-ecosystem-testing` directory
cd agent-ecosystem-testing

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
# Windows: venv\Scripts\activate
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Navigate to the Copilot testing directory
cd copilot-web-content-retrieval

For whatever reason, such as incompatible Python versions or some accidental corruption,
use rm -rf venv to remove the venv and start over

Workflow

  1. List Available Tests

    python web_content_retrieval_testing_framework.py --list-tests
    
  2. Generate Test Prompt for a Single Test

    Print a formatted test harness with a structured prompt to copy into the Copilot chat window, fields requiring values, and expected size reference:

    # Copilot-interpreted track - ask model to report measurements
    python web_content_retrieval_testing_framework.py --test BL-1 --track interpreted
    
    # Raw track - request verbatim output
    python web_content_retrieval_testing_framework.py --test BL-1 --track raw
    
  3. Copy Prompt → Run in Copilot

    • Review the terminal output → copy the prompt
    • Open Copilot chat window → paste the prompt
    • Inspect Copilot’s web content retrieval behavior → examine the agent’s output
  4. Assess Hypotheses

    Before logging test results, assess the run against each hypothesis based on the model’s self-reported metrics and tool visibility output:

    ID Description Question
    H1 Character-based truncation
    at fixed limit
    Is there a ceiling at ~10–100 KB?
    H2 Token-based truncation Is there a ceiling at ~2,000 tokens?
    H3 Structure-aware truncation Does truncation fall on Markdown boundaries
    rather than arbitrary byte positions?
    H4* MCP servers impact* Do MCP servers override native vscode-chat limits?
    H5 Agentic auto-chunking Does the agent fetch chunks automatically,
    or only when reasoned into it?

    *H4 not testable through vscode-chat alone, read Friction: Interpreted for analysis.

  5. Log Results

    Depending on the track, store results in copilot-web-content-retrieval/results/{track}/results.csv with the following fields:

    Column Description Example
    test_id Test identifier BL-1, SC-2, EC-1
    timestamp ISO 8601 format 2026-03-16T17:05:02.998376
    date Date tested 2026-03-16
    url Full URL tested https://www.mongodb.com/docs...
    method Retrieval method vscode-chat*
    model_selector Model selector setting Auto
    model_observed Model invoked by Auto Claude Haiku 4.5,
    GPT-5.3-Codex
    input_est_chars Expected input size in characters 87040
    hypothesis_match Hypothesis success/failure H1-no, H2-yes,
    H3-partial
    copilot_version Copilot extension version 0.40.1, 0.41.1-pro
    notes Observations Pro-plan retry: successfully...
    output_chars Interpreted: Copilot-measured output length 27890
    truncated Interpreted: truncation detected yes/no
    truncation_char_num Interpreted: character position if truncated 5857
    tokens_est Interpreted: estimated token count 16890
    tools_used** Raw: requested tool chain fetch_webpage -> pylanceRunCodeSnippet
    tools_blocked** Raw: tools requested but blocked/skipped curl, terminal execution
    execution_attempts** Raw: total tool calls including fallbacks 3
    copilot_reported_output_chars** Raw: Copilot-measured output character count 9876
    copilot_reported_truncated** Raw: Copilot-measured truncation status yes/no
    copilot_reported_truncation_point** Raw: Copilot-measured truncation position 9876
    copilot_reported_tokens_est** Raw: Copilot-estimated token count 2469
    copilot_reported_file_size_bytes** Raw: Copilot-measured file size in bytes 4817
    copilot_reported_md5_checksum** Raw: Copilot-measured MD5 checksum abc123...
    copilot_reported_lines** Raw: Copilot-measured line count 143
    copilot_reported_words** Raw: Copilot-measured word count 564
    copilot_reported_code_blocks** Raw: Copilot-measured code block count 2
    copilot_reported_table_rows** Raw: Copilot-measured table row count 57
    copilot_reported_headers** Raw: Copilot-measured header count 4
    verified_file_size_bytes** Raw: Verifier-measured file size in bytes 4817
    verified_md5_checksum** Raw: Verifier-measured MD5 checksum d6ad8451d3778bf3544574...
    verified_total_lines** Raw: Verifier-measured line count 143
    verified_total_words** Raw: Verifier-measured word count 564
    verified_tokens** Raw: Verifier-measured token count 197
    verified_chars_per_token** Raw: Verifier-measured chars/token ratio 4.43
    verified_code_blocks** Raw: Verifier-measured code block count 2
    verified_table_rows** Raw track: Verifier-measured table row count 57
    verified_headers** Raw track: Verifier-measured header count 4

    *vscode-chat describes an intentionally manual process: user copy-pastes prompts into the Copilot chat window; Copilot has no documented backend web content retrieval mechanism; analysis in the Friction Note.

    **Optional field, raw track only. copilot_reported fields may reflect execution tool output or payload estimates; web_content_retrieval_verify_raw_results.py script calculates values against saved raw_output_{test_id}.txt files.

    # Log interpreted track result
    python web_content_retrieval_testing_framework.py --log BL-1 \
    --track interpreted \
    --method vscode-chat \
    --model_selector Auto \
    --model_observed "Raptor mini (Preview)"* \
    --copilot_version "0.40.1-pro" \
    --output_chars 48500 \
    --truncated no \
    --tokens 12000 \
    --hypothesis "H1-no" \
    --notes "Full content returned, no truncation observed..."
    

    *Quotations are only required when the value contains spaces or special characters that the shell would otherwise split or misinterpret

    # Verify key metrics before logging raw track runs
    python web_content_retrieval_verify_raw_results.py BL-1
    
    # Log raw track result
    python web_content_retrieval_testing_framework.py --log BL-1 \
    --track raw \
    --method vscode-chat \
    --model_selector Auto \
    --model_observed "Raptor mini (Preview)" \
    --copilot_version "0.40.1-pro" \
    --copilot_reported_output_chars 9876 \
    --copilot_reported_truncated yes \
    --copilot_reported_truncation_point 9876 \
    --copilot_reported_tokens_est 2469 \
    --copilot_reported_file_size_bytes 4817 \
    --copilot_reported_md5_checksum abc123 \
    --copilot_reported_lines 143 \
    --copilot_reported_words 564 \
    --copilot_reported_code_blocks 2 \
    --copilot_reported_table_rows 57 \
    --copilot_reported_headers 4 \
    --tools_used "fetch_webpage -> pylanceRunCodeSnippet" \
    --tools_blocked "terminal execution" \
    --execution_attempts 2 \
    --verified_file_size_bytes 4817 \
    --verified_md5_checksum d6ad8451d3778bf3544574431203a3a7 \
    --verified_total_lines 143 \
    --verified_total_words 564 \
    --verified_tokens 197 \
    --verified_chars_per_token 4.43 \
    --verified_code_blocks 2 \
    --verified_table_rows 57 \
    --verified_headers 4 \
    --hypothesis "H1-yes" \
    --notes "vscode-chat returns converted..."
    

    Ensure to provide all required flags: --method, --model, --copilot-version,
    --output-chars, --truncated, --tokens, --hypothesis


    Rename raw output files to capture variance; if results are consistent,
    remove files to prevent test contamination between runs


Baseline Testing Path

  1. Run interpreted track to identify baseline behavioral patterns
  2. Run raw track for ground truth measurements, verify interpreted baseline
  3. Run each test ID a minimum of 5 times/track to capture variance:
Test IDs Purpose Key Question
BL-1
BL-2
Baseline truncation
threshold on small pages
What is the interpreted vs raw delta?
SC-2 Code blocks,
HTML-to-Markdown conversion
How does fetch_webpage handle
code structure?
OP-4 Auto-chunking
hypothesis
Does Copilot chunk automatically,
or is this a key ecosystem gap?
BL-3 Hard ceiling What is the absolute output
limit across model families?
SC-1
SC-3
SC-4
Structured content Does truncation respect
Markdown boundaries?
EC-1
EC-3
EC-6
Edge cases What are the failure modes
and unusual inputs?

Analyzing Results

Examine hypotheses matching, track comparison, and truncation analysis -

# Generate full analysis report
python web_content_retrieval_results_analyzer.py --csv results.csv --full

# Generate summary
python web_content_retrieval_results_analyzer.py --csv results.csv --summary

# Analyze specific methods
python web_content_retrieval_results_analyzer.py --csv results.csv --method "vscode-chat"

# Compare interpreted and raw results
python web_content_retrieval_results_analyzer.py \
        --csv results/copilot-interpreted/results.csv results/raw/results.csv --full

Provide full relative path, including subdirectory: results/copilot-interpreted/results.csv
or results/raw/results.csv