Copilot Framework Reference
This framework generates standardized test prompts and logs CSV results, enabling consistent testing across cases, measurement tracking over time, truncation pattern identification, and web content retrieval method comparisons
Requirements: Python 3.8+, VS Code GitHub Copilot Extension
Installation
# Clone and/or navigate to `agent-ecosystem-testing` directory
cd agent-ecosystem-testing
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# Windows: venv\Scripts\activate
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Navigate to the Copilot testing directory
cd copilot-web-content-retrieval
For whatever reason, such as incompatible Python versions or some accidental corruption,
userm -rf venvto remove thevenvand start over
Workflow
-
List Available Tests
python web_content_retrieval_testing_framework.py --list-tests -
Generate Test Prompt for a Single Test
Print a formatted test harness with a structured prompt to copy into the Copilot chat window, fields requiring values, and expected size reference:
# Copilot-interpreted track - ask model to report measurements python web_content_retrieval_testing_framework.py --test BL-1 --track interpreted # Raw track - request verbatim output python web_content_retrieval_testing_framework.py --test BL-1 --track raw -
Copy Prompt → Run in Copilot
- Review the terminal output → copy the prompt
- Open Copilot chat window → paste the prompt
- Inspect Copilot’s web content retrieval behavior → examine the agent’s output
-
Assess Hypotheses
Before logging test results, assess the run against each hypothesis based on the model’s self-reported metrics and tool visibility output:
ID Description Question H1Character-based truncation
at fixed limitIs there a ceiling at ~10–100 KB? H2Token-based truncation Is there a ceiling at ~2,000 tokens? H3Structure-aware truncation Does truncation fall on Markdown boundaries
rather than arbitrary byte positions?H4*MCP servers impact* Do MCP servers override native vscode-chatlimits?H5Agentic auto-chunking Does the agent fetch chunks automatically,
or only when reasoned into it?*
H4not testable throughvscode-chatalone, read Friction: Interpreted for analysis. -
Log Results
Depending on the track, store results in
copilot-web-content-retrieval/results/{track}/results.csvwith the following fields:Column Description Example test_idTest identifier BL-1,SC-2,EC-1timestampISO 8601format2026-03-16T17:05:02.998376dateDate tested 2026-03-16urlFull URL tested https://www.mongodb.com/docs...methodRetrieval method vscode-chat*model_selectorModel selector setting Automodel_observedModel invoked by AutoClaude Haiku 4.5,GPT-5.3-Codexinput_est_charsExpected input size in characters 87040hypothesis_matchHypothesis success/failure H1-no,H2-yes,H3-partialcopilot_versionCopilot extension version 0.40.1,0.41.1-pronotesObservations Pro-plan retry: successfully...output_charsInterpreted: Copilot-measured output length 27890truncatedInterpreted: truncation detected yes/notruncation_char_numInterpreted: character position if truncated 5857tokens_estInterpreted: estimated token count 16890tools_used**Raw: requested tool chain fetch_webpage -> pylanceRunCodeSnippettools_blocked**Raw: tools requested but blocked/skipped curl, terminal executionexecution_attempts**Raw: total tool calls including fallbacks 3copilot_reported_output_chars**Raw: Copilot-measured output character count 9876copilot_reported_truncated**Raw: Copilot-measured truncation status yes/nocopilot_reported_truncation_point**Raw: Copilot-measured truncation position 9876copilot_reported_tokens_est**Raw: Copilot-estimated token count 2469copilot_reported_file_size_bytes**Raw: Copilot-measured file size in bytes 4817copilot_reported_md5_checksum**Raw: Copilot-measured MD5 checksum abc123...copilot_reported_lines**Raw: Copilot-measured line count 143copilot_reported_words**Raw: Copilot-measured word count 564copilot_reported_code_blocks**Raw: Copilot-measured code block count 2copilot_reported_table_rows**Raw: Copilot-measured table row count 57copilot_reported_headers**Raw: Copilot-measured header count 4verified_file_size_bytes**Raw: Verifier-measured file size in bytes 4817verified_md5_checksum**Raw: Verifier-measured MD5 checksum d6ad8451d3778bf3544574...verified_total_lines**Raw: Verifier-measured line count 143verified_total_words**Raw: Verifier-measured word count 564verified_tokens**Raw: Verifier-measured token count 197verified_chars_per_token**Raw: Verifier-measured chars/token ratio 4.43verified_code_blocks**Raw: Verifier-measured code block count 2verified_table_rows**Raw track: Verifier-measured table row count 57verified_headers**Raw track: Verifier-measured header count 4*
vscode-chatdescribes an intentionally manual process: user copy-pastes prompts into the Copilot chat window; Copilot has no documented backend web content retrieval mechanism; analysis in the Friction Note.**Optional field, raw track only.
copilot_reportedfields may reflect execution tool output or payload estimates;web_content_retrieval_verify_raw_results.pyscript calculates values against savedraw_output_{test_id}.txtfiles.# Log interpreted track result python web_content_retrieval_testing_framework.py --log BL-1 \ --track interpreted \ --method vscode-chat \ --model_selector Auto \ --model_observed "Raptor mini (Preview)"* \ --copilot_version "0.40.1-pro" \ --output_chars 48500 \ --truncated no \ --tokens 12000 \ --hypothesis "H1-no" \ --notes "Full content returned, no truncation observed..."*Quotations are only required when the value contains spaces or special characters that the shell would otherwise split or misinterpret
# Verify key metrics before logging raw track runs python web_content_retrieval_verify_raw_results.py BL-1 # Log raw track result python web_content_retrieval_testing_framework.py --log BL-1 \ --track raw \ --method vscode-chat \ --model_selector Auto \ --model_observed "Raptor mini (Preview)" \ --copilot_version "0.40.1-pro" \ --copilot_reported_output_chars 9876 \ --copilot_reported_truncated yes \ --copilot_reported_truncation_point 9876 \ --copilot_reported_tokens_est 2469 \ --copilot_reported_file_size_bytes 4817 \ --copilot_reported_md5_checksum abc123 \ --copilot_reported_lines 143 \ --copilot_reported_words 564 \ --copilot_reported_code_blocks 2 \ --copilot_reported_table_rows 57 \ --copilot_reported_headers 4 \ --tools_used "fetch_webpage -> pylanceRunCodeSnippet" \ --tools_blocked "terminal execution" \ --execution_attempts 2 \ --verified_file_size_bytes 4817 \ --verified_md5_checksum d6ad8451d3778bf3544574431203a3a7 \ --verified_total_lines 143 \ --verified_total_words 564 \ --verified_tokens 197 \ --verified_chars_per_token 4.43 \ --verified_code_blocks 2 \ --verified_table_rows 57 \ --verified_headers 4 \ --hypothesis "H1-yes" \ --notes "vscode-chat returns converted..."Ensure to provide all required flags:
--method,--model,--copilot-version,--output-chars,--truncated,--tokens,--hypothesis
Rename raw output files to capture variance; if results are consistent,
remove files to prevent test contamination between runs
Baseline Testing Path
- Run interpreted track to identify baseline behavioral patterns
- Run raw track for ground truth measurements, verify interpreted baseline
- Run each test ID a minimum of 5 times/track to capture variance:
| Test IDs | Purpose | Key Question |
|---|---|---|
BL-1BL-2 |
Baseline truncation threshold on small pages |
What is the interpreted vs raw delta? |
SC-2 |
Code blocks, HTML-to-Markdown conversion |
How does fetch_webpage handlecode structure? |
OP-4 |
Auto-chunking hypothesis |
Does Copilot chunk automatically, or is this a key ecosystem gap? |
BL-3 |
Hard ceiling | What is the absolute output limit across model families? |
SC-1SC-3SC-4 |
Structured content | Does truncation respect Markdown boundaries? |
EC-1EC-3EC-6 |
Edge cases | What are the failure modes and unusual inputs? |
Analyzing Results
Examine hypotheses matching, track comparison, and truncation analysis -
# Generate full analysis report
python web_content_retrieval_results_analyzer.py --csv results.csv --full
# Generate summary
python web_content_retrieval_results_analyzer.py --csv results.csv --summary
# Analyze specific methods
python web_content_retrieval_results_analyzer.py --csv results.csv --method "vscode-chat"
# Compare interpreted and raw results
python web_content_retrieval_results_analyzer.py \
--csv results/copilot-interpreted/results.csv results/raw/results.csv --full
Provide full relative path, including subdirectory:
results/copilot-interpreted/results.csv
orresults/raw/results.csv
Agent Ecosystem Testing