Cursor Framework Reference
This framework generates standardized test prompts and logs CSV results, enabling consistent testing across cases, measurement tracking over time, truncation pattern identification, and fetch method comparisons
Requirements: Python 3.8+, Cursor IDE
Installation
# Clone and/or navigate to `agent-ecosystem-testing` directory
cd agent-ecosystem-testing
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# Windows: venv\Scripts\activate
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Navigate to the Cursor testing directory
cd cursor-web-fetch
For whatever reason, such as incompatible Python versions or some accidental corruption,
userm -rf venvto remove thevenvand start over
Workflow
-
List Available Tests
python web_fetch_testing_framework.py --list-tests -
Generate Test Prompt for a Single Test
Print a formatted test harness with a structured prompt to copy into the Cursor chat window, fields requiring values, and expected size reference:
# Cursor-interpreted track - ask model to report measurements python web_fetch_testing_framework.py --test BL-1 --track interpreted # Raw track - request verbatim output python web_fetch_testing_framework.py --test BL-2 --track raw -
Copy Prompt → Run in Cursor
- Review the Terminal output → copy the prompt
- Open Cursor chat window → paste the prompt
- Inspect Cursor’s fetch behavior → examine the agent output
-
Assess Hypotheses
Before logging test results, assess the run against each hypothesis based on the model’s self-reported metrics and tool visibility output:
ID Description Question H1Character-based truncation
at fixed limitIs there a ceiling at ~10–100 KB? H2Token-based truncation Is there a ceiling at ~2,000 tokens? H3Structure-aware truncation Does truncation fall on Markdown boundaries
rather than arbitrary byte positions?H4*@WebinvocationDoes @Webimpact web fetch behavior?H5Agentic auto-chunking Does the agent fetch chunks automatically,
or only when reasoned into it?*
@Webmay route tomcp_web_fetch; mechanism is agent’s choice, not user-controllable;H4not testable through@Webalone, visit Friction Note for analysis. -
Log Results
Store results in
cursor-web-fetch/results/{track}/results.csvwith the following fields:Column Description Example test_idTest identifier BL-1,SC-2,EC-1timestampISO 8601timestamp2026-03-16T17:05:02.998376dateDate tested 2026-03-16urlFull URL tested https://www.mongodb.com/docs...methodFetch method @Web*model***Model used Auto- Cursor’s agent routerinput_est_charsExpected input size 87040output_charsCharacter count via wc -m27890truncatedTruncation detected yes/notruncation_char_numCharacter position if truncated 5857tokens_estToken estimation or
count viatiktoken16890hypothesis_matchHypothesis matched H1-no,H2-yes,H3-yesnotesObservations and findings Pro-plan retry: successfully...trackTest track interpreted/rawcursor_versionCursor IDE version 2.6.19,2.6.19-profile_size_bytes**File size calculation ls -l28158md5_checksum**MD5 of saved output file d542d945f2b5dc15c5254d...total_lines**Line count 979total_words**Word count 4871code_blocks**Fenced code block count 24table_rows**Table row count 87headers**Header count 63*
@Webis a Cursor UI composer feature, the underlying mechanisms areWebFetchand/ormcp_web_fetch, more in Friction Note
**Optional field, measurement for raw track results only
***Cursor’sAutosetting doesn’t disclose specific model used# Log interpreted track result python web_fetch_testing_framework.py --log BL-1 \ --track interpreted \ --method @Web \ --model "Auto" \ --cursor-version "2.6.19" \ --output-chars 48500 \ --truncated no \ --tokens 12000 \ --hypothesis "H1-no" \ --notes "Full content returned, no truncation observed..."# Verify key metrics before logging raw track runs python web_fetch_verify_raw_results.py BL-1 # Log raw track result python web_fetch_testing_framework.py --log BL-1 \ --track raw \ --method @Web \ --model "Auto" \ --cursor-version "2.6.19" \ --output-chars 9876 \ --truncated yes \ --truncation-point 9876 \ --tokens 2469 \ --hypothesis "H1-yes" \ --file-size-bytes 4817 \ --md5-checksum "d6ad8451d3778bf3544574431203a3a7" \ --total-lines 143 \ --total-words 564 \ --code-blocks 2 \ --table-rows 57 \ --headers 4 \ --notes "@Web returns converted..."Provide all required flags:
--method,--model,--cursor-version,--output-chars,--truncated,--tokens,--hypothesis
Rename raw output files to capture variance; if results are consistent, remove files to prevent test contamination between runs
Baseline Testing Path
- Run interpreted track to identify baseline behavioral patterns
- Run raw track for ground truth measurements, verify interpreted baseline
- Run each test ID a minimum of 5 times/track to capture variance:
| Test IDs | Purpose | Key Question |
|---|---|---|
BL-1BL-2 |
Baseline truncation threshold on small pages |
What is the interpreted vs raw delta? |
SC-2 |
Code blocks, HTML-to-Markdown conversion |
How does Cursor handle code structure? |
OP-3 |
@Web vs MCP |
Do MCP servers have different limits?* |
OP-4 |
Auto-pagination hypothesis |
Does Cursor auto-chunk content? |
BL-3 |
Hard ceiling | What is the absolute output limit across runs? |
SC-1SC-3SC-4 |
Structured content | Does truncation respect Markdown boundaries? |
EC-1EC-3EC-6 |
Edge cases | What are the failure modes and approval-gating edge behaviors? |
*
OP-3not executable as designed;@Webmay route tomcp_web_fetch; the two “sides” of the comparison aren’t separable through@Webalone; read more in Friction Note.
Analyzing Results
Review hypotheses matching, tracking comparisons, and truncation analysis:
# Generate full analysis report
python web_fetch_results_analyzer.py --csv results.csv --full
# Generate summary
python web_fetch_results_analyzer.py --csv results.csv --summary
# Analyze specific methods
python web_fetch_results_analyzer.py --csv results.csv --method "@Web"
Provide full relative path, including subdirectory:
results/cursor-interpreted/results.csvorresults/raw/results.csv
Agent Ecosystem Testing