Key Findings for Gemini’s URL Context Tool, Gemini-interpreted

Test Workflow

Call the Gemini API with the URL context tool enabled
Give Gemini a detailed prompt asking it to describe what it retrieved:
content length, structure, completeness, any failures
Gemini fetches each URL via its pre-retrieval step, then generates a response
reflecting on what it received
Capture Gemini’s full text response as the interpreted finding
Capture url_context_metadata and usage_metadata for cross-referencing
against the raw track
The gap between Gemini’s self-report and the raw metadata is itself a finding,
and discrepancies belong in the spec
Results stored in gemini-url-context/results/gemini-interpreted/

Results Summary

Test	URLs Req	URLs OK	Tokens R1/R2/R3	Result
`test_1_single_html`	1	1	3,142 3,151 3,141	Consistent across runs
`test_2_single_pdf`	1	0	147 151 156	`URL_RETRIEVAL_STATUS_ERROR`
`test_3_multi_url_5`	5	5	27,564 27,579 27,572	Consistent across runs
`test_4_multi_url_20`	20	20	111,401 111,714 111,375	Consistent across runs
`test_5_multi_url_21`	21	0	null	`400 INVALID_ARGUMENT`
`test_6_unsupported_youtube`	1	1	1,288 1,291 1,570	Succeeded - docs say unsupported. Token variance in run 3
`test_7_unsupported_google_doc`	1	0	null 181 192	`429` and `URL_RETRIEVAL_STATUS_ERROR`, run 1 rate-limited, runs 2-3 confirm
`test_8_json_content`	1	0	null 133 null	`429` and r2: `URL_RETRIEVAL_STATUS_ERROR`, run 1, 3 daily quota exhausted

gemini-2.5-flash, free tier daily cap 20 RPD reached after test_7 run 3

Truncation Analysis

Finding	Tests	Observation	Spec Detail
20-URL limit is a hard limit	`test_5`	`400 INVALID_ARGUMENT`: `"Number of urls to lookup exceeds<br>the limit (21 > 20)"`. `url_context_metadata` empty, zero URL content tokens consumed.	API layer enforces limit before retrieval, not truncation or silent dropping
YouTube succeeds despite documented as unsupported	`test_6`	`URL_RETRIEVAL_STATUS_SUCCESS`; tool tokens: 1,288 / 1,291 / 1,570. Run 3 returned ~22% more tokens, suggesting live-fetch vs. cache variation.	Docs don’t reflect current behavior on `gemini-2.5-flash` as of March 2026; token variance suggests cache vs. live-fetch switching
PDF retrieval failed consistently on a valid public PDF	`test_2`	`URL_RETRIEVAL_STATUS_ERROR`; tool tokens: 147 / 151 / 156 - minimal, consistent error response. PDF is a documented supported type.	PDF retrieval fails reliably for this W3C URL; follow-up needed with a different source before drawing a firm conclusion
Google Docs fail at retrieval layer, not API layer	`test_7` runs 2 & 3	`URL_RETRIEVAL_STATUS_ERROR` with tool tokens 181 / 192. Request completes normally, no API-level error. Contrasts with `test_5`, API layer rejected.	Failure modes: API-layer rejection; hard error, zero retrieval vs. retrieval-layer failure, request completes, status in metadata
JSON API endpoint failed retrieval	`test_8` r2 only	`URL_RETRIEVAL_STATUS_ERROR`, 133 tool tokens. JSON is a documented supported type. GitHub API requires auth headers tool can’t supply; run 1, run 3 hit quota before test.	JSON support applies to public, unauthenticated endpoints. Endpoints requiring auth headers return `URL_RETRIEVAL_STATUS_ERROR`
Tool tokens dominate cost at scale	`test_1` `test_3` `test_4`	20 URLs, tool tokens ~111,400 = ~98.6% of total cost; tool token counts vary <1% between runs; see token scaling table below.	Use `tool_use_prompt_token_count` for cost est; stable, reproducible, accounts for ~98.6% of cost
`url_context_metadata` order non-deterministic	`test_3` `test_4`	Metadata order shuffled relative to input order on every run; pattern itself varied between runs	Match results by `retrieved_url` string, not array index
Gemini’s char count estimates vary within, across runs	`test_1` `test_3`	`test_1` same URL: 10,950 / 17,476 / 15,221 chars; `test_3` same URL in multi-URL context: ~11,360 / ~20,400 / ~11,500. Variance large, non-directional.	Interpreted character counts not citable; `tool_use_prompt_token_count` variance <1%, only reproducible proxy for content size
Free tier imposes per-minute, per-day limits	`test_7` `test_8`	Run 1 `429` errors: `GenerateRequestsPerMinutePerProjectPerModel-FreeTier` limit: 5 RPM; run 3 `GenerateRequestsPerDayPerProjectPerModel-FreeTier` limit: 20 RPD; 3 runs × ~7 tests exhausted daily quota.	Free tier: 5 RPM, 20 RPD on `gemini-2.5-flash`; running interpreted, raw tracks same day exhaust the daily limit; plan across days or use a paid tier
Duplicate response in test 3 was non-reproducible	`test_3`	Run 1 produced full results table twice in sequence; runs 2-3 didn’t; non-deterministic model output artifact.	Treat as known non-determinism; raw metadata authoritative source for retrieval counts, statuses

Token Scaling Details

Test	URLs	Tool Tokens	Prompt Tokens	% Tool
`test_1_single_html`	1	~3,145 avg	65	~86.7%
`test_3_multi_url_5`	5	~27,572 avg	137	~96.6%
`test_4_multi_url_20`	20	~111,497 avg	417	~98.6%