Copilot Framework Reference
This framework generates standardized test prompts and logs CSV results, enabling consistent testing across cases, measurement tracking over time, truncation pattern identification, and web content retrieval method comparisons
Requirements: Python 3.8+, VS Code GitHub Copilot Extension
Topic Guide
Installation
# Clone and/or navigate to `agent-ecosystem-testing` directory
cd agent-ecosystem-testing
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# Windows: venv\Scripts\activate
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Navigate to the Copilot testing directory
cd copilot-web-content-retrieval
For whatever reason, such as incompatible Python versions or some accidental corruption, use
rm -rf venvto remove thevenvand start over
Workflow
-
List Available Tests
python web_content_retrieval_testing_framework.py --list-tests -
Generate Test Prompt for a Single Test
Print a formatted test harness with a structured prompt to copy into the Copilot chat window, fields requiring values, and expected size reference:
# Copilot-interpreted track - ask model to report measurements python web_content_retrieval_testing_framework.py --test BL-1 --track interpreted # Raw track - request verbatim output python web_content_retrieval_testing_framework.py --test BL-1 --track raw -
Copy Prompt → Run in Copilot
- Review the Terminal output → copy the prompt
- Open Copilot chat window → paste the prompt
- Review Copilot’s web content retrieval behavior → examine the response
-
Log Results
Depending on the track, results stored in
copilot-web-content-retrieval/results/{track}/results.csvwith the following fields:Column Description Example test_idTest identifier BL-1,SC-2,EC-1timestampISO 8601 format 2026-03-16T17:05:02.998376dateDate tested 2026-03-16urlFull URL tested https://www.mongodb.com/docs...methodRetrieval method vscode-chat*model_selectorUI model selector setting Automodel_observedBackend model invoked by Auto Claude Haiku 4.5,GPT-5.3-Codexinput_est_charsExpected input size in characters 87040hypothesis_matchHypothesis success/failure H1-no,H2-yes,H3-partialcopilot_versionCopilot extension version 0.40.1,0.41.1-pronotesObservations, findings Pro-plan retry: successfully...output_charsInterpreted track: Copilot-measured output length 27890truncatedInterpreted track: truncation detected yes/notruncation_char_numInterpreted track: character position if truncated 5857tokens_estInterpreted track: estimated token count 16890tools_used**Raw track: requested tool chain fetch_webpage -> pylanceRunCodeSnippettools_blocked**Raw track: tools requested but blocked or skipped curl (default), terminal executionexecution_attempts**Raw track: total tool calls including fallbacks 3copilot_reported_output_chars**Raw track: Copilot-measured output character count 9876copilot_reported_truncated**Raw track: Copilot-measured truncation status yes/nocopilot_reported_truncation_point**Raw track: Copilot-measured truncation position 9876copilot_reported_tokens_est**Raw track: Copilot-estimated token count 2469copilot_reported_file_size_bytes**Raw track: Copilot-measured file size in bytes 4817copilot_reported_md5_checksum**Raw track: Copilot-measured MD5 checksum abc123...copilot_reported_lines**Raw track: Copilot-measured line count 143copilot_reported_words**Raw track: Copilot-measured word count 564copilot_reported_code_blocks**Raw track: Copilot-measured code block count 2copilot_reported_table_rows**Raw track: Copilot-measured table row count 57copilot_reported_headers**Raw track: Copilot-measured header count 4verified_file_size_bytes**Raw track: Verifier-measured file size in bytes 4817verified_md5_checksum**Raw track: Verifier-measured MD5 checksum d6ad8451d3778bf3544574...verified_total_lines**Raw track: Verifier-measured line count 143verified_total_words**Raw track: Verifier-measured word count 564verified_tokens**Raw track: Verifier-measured token count 197verified_chars_per_token**Raw track: Verifier-measured chars/token ratio 4.43verified_code_blocks**Raw track: Verifier-measured code block count 2verified_table_rows**Raw track: Verifier-measured table row count 57verified_headers**Raw track: Verifier-measured header count 4*
vscode-chatdescribes an intentionally manual testing process in which the user copy-pastes prompts into the Copilot chat window; Copilot has no documented backend web content retrieval mechanism; analysis in the Friction Note**Optional field, raw track only.
copilot_reportedfields capture values measured by Copilot and may reflect execution tool output or payload estimates;verify_raw_resultsscript calculatesverifiedfields against saved output files.
Key Hypotheses:
H1: Character-based truncation at fixed limit, ~10-100KB?H2: Token-based truncation, ~2000 tokens?H3: Structure-aware truncation, respects Markdown boundariesH4: MCP servers override nativevscode-chatlimits*H5: Agent auto-chunks after truncation, requests next chunk automatically
*
H4not testable throughvscode-chatalone; analysis in the Friction Note
# Log interpreted track result
python web_content_retrieval_testing_framework.py --log BL-1 \
--track interpreted \
--method vscode-chat \
--model_selector Auto \
--model_observed "Raptor mini (Preview)"* \
--copilot_version 0.40.1 \
--output_chars 48500 \
--truncated no \
--tokens 12000 \
--hypothesis "H1-no" \
--notes "Full content returned, no truncation observed..."
*Quotations are only required when the value contains spaces or special characters that the shell would otherwise split or misinterpret
# Verify key metrics before logging raw track runs
python web_content_retrieval_verify_raw_results.py BL-1
# Log raw track result
python web_content_retrieval_testing_framework.py --log BL-1 \
--track raw \
--method vscode-chat \
--model_selector Auto \
--model_observed "Raptor mini (Preview)" \
--copilot_version 0.41.1 \
--copilot_reported_output_chars 9876 \
--copilot_reported_truncated yes \
--copilot_reported_truncation_point 9876 \
--copilot_reported_tokens_est 2469 \
--copilot_reported_file_size_bytes 4817 \
--copilot_reported_md5_checksum abc123 \
--copilot_reported_lines 143 \
--copilot_reported_words 564 \
--copilot_reported_code_blocks 2 \
--copilot_reported_table_rows 57 \
--copilot_reported_headers 4 \
--tools_used "fetch_webpage -> pylanceRunCodeSnippet" \
--tools_blocked "terminal execution" \
--execution_attempts 2 \
--verified_file_size_bytes 4817 \
--verified_md5_checksum d6ad8451d3778bf3544574431203a3a7 \
--verified_total_lines 143 \
--verified_total_words 564 \
--verified_tokens 197 \
--verified_chars_per_token 4.43 \
--verified_code_blocks 2 \
--verified_table_rows 57 \
--verified_headers 4 \
--hypothesis "H1-yes" \
--notes "vscode-chat returns converted..."
Ensure to provide all required flags:
--method,--model,--copilot-version,--output-chars,--truncated,--tokens,--hypothesis
Raw track only: consider renaming raw output text files to capture variance; upon consistent results, remove files from the project to prevent test contamination between runs
Baseline Testing Path
Complete the interpreted track first to establish behavioral observations, then run
the raw track for exact measurements. Run each test ID a minimum of 5 times to capture
variance. Auto routing selects different models across runs, and output size can vary
2–6x on identical prompts. Run both tracks for each test ID:
BL-1,BL-2- baseline truncation threshold on small pagesSC-2- code blocks, HTML-to-Markdown conversion behaviorOP-4- auto-chunking hypothesis; establishes key ecosystem testing gapBL-3- hard ceiling; identify absolute output limit across model familiesSC-1,SC-3,SC-4- structured content; structure-aware truncation hypothesisEC-1,EC-3,EC-6- edge cases; failure modes and unusual inputs
While the interpreted track captures Copilot’s self-report and perceived completeness, the raw track provides ground truth measurements for validation. Cross-referencing reveals where Copilot’s self-assessment diverges from reality. Comprehensive truncation pattern analysis requires both datasets.
Analyzing Results
Examine truncation threshold analysis, method comparison, interpretive vs raw track comparisons, hypothesis matching -
# Generate full analysis report
python web_content_retrieval_results_analyzer.py --csv results.csv --full
# Generate summary
python web_content_retrieval_results_analyzer.py --csv results.csv --summary
# Analyze specific methods
python web_content_retrieval_results_analyzer.py --csv results.csv --method "vscode-chat"
# Compare interpreted and raw results
python web_content_retrieval_results_analyzer.py \
--csv results/copilot-interpreted/results.csv results/raw/results.csv --full
Provide the full relative path to the CSV file when running the analyzer, including the subdirectory:
results/copilot-interpreted/results.csvorresults/raw/results.csv