Key Findings for Gemini’s URL Context Tool, Gemini-interpreted
Test Workflow
- Call the Gemini API with the URL context tool enabled
- Give Gemini a detailed prompt asking it to describe what it retrieved:
content length, structure, completeness, any failures - Gemini fetches each URL via its pre-retrieval step, then generates a
response
reflecting on what it received - Capture Gemini’s full text response as the interpreted finding
- Capture
url_context_metadataandusage_metadatafor cross-referencing
against the raw track - The gap between Gemini’s self-report and the raw metadata is itself
a finding,
and discrepancies belong in the spec - Results stored in
gemini-url-context/results/gemini-interpreted/
Results Summary
| Test | URLs Req | URLs OK | Tokens R1/R2/R3 |
Result |
|---|---|---|---|---|
test_1_single_html |
1 | 1 | 3,142 3,151 3,141 |
Consistent across runs |
test_2_single_pdf |
1 | 0 | 147 151 156 |
URL_RETRIEVAL_STATUS_ERROR |
test_3_multi_url_5 |
5 | 5 | 27,564 27,579 27,572 |
Consistent across runs |
test_4_multi_url_20 |
20 | 20 | 111,401 111,714 111,375 |
Consistent across runs |
test_5_multi_url_21 |
21 | 0 | null | 400 INVALID_ARGUMENT |
test_6_unsupported_youtube |
1 | 1 | 1,288 1,291 1,570 |
Succeeded - docs say unsupported. Token variance in run 3 |
test_7_unsupported_google_doc |
1 | 0 | null 181 192 |
429 and URL_RETRIEVAL_STATUS_ERROR,run 1 rate-limited, runs 2-3 confirm |
test_8_json_content |
1 | 0 | null 133 null |
429 and r2: URL_RETRIEVAL_STATUS_ERROR,run 1, 3 daily quota exhausted |
gemini-2.5-flash, free tier daily cap 20 RPD reached aftertest_7run 3
Truncation Analysis
| Finding | Tests | Observation | Spec Detail |
|---|---|---|---|
| 20-URL limit is a hard limit | test_5 |
400 INVALID_ARGUMENT:"Number of urls to lookup exceeds<br>the limit (21 > 20)". url_context_metadata empty,zero URL content tokens consumed. |
API layer enforces limit before retrieval, not truncation or silent dropping |
| YouTube succeeds despite documented as unsupported | test_6 |
URL_RETRIEVAL_STATUS_SUCCESS; tool tokens: 1,288 / 1,291 / 1,570. Run 3 returned ~22% more tokens, suggesting live-fetch vs. cache variation. |
Docs don’t reflect current behavior on gemini-2.5-flash as of March 2026; token variance suggests cache vs. live-fetch switching |
| PDF retrieval failed consistently on a valid public PDF | test_2 |
URL_RETRIEVAL_STATUS_ERROR; tool tokens: 147 / 151 / 156 - minimal, consistent error response. PDF is a documented supported type. |
PDF retrieval fails reliably for this W3C URL; follow-up needed with a different source before drawing a firm conclusion |
| Google Docs fail at retrieval layer, not API layer | test_7 runs 2 & 3 |
URL_RETRIEVAL_STATUS_ERROR with tool tokens 181 / 192. Request completes normally, no API-level error. Contrasts with test_5, API layer rejected. |
Failure modes: API-layer rejection; hard error, zero retrieval vs. retrieval-layer failure, request completes, status in metadata |
| JSON API endpoint failed retrieval | test_8 r2 only |
URL_RETRIEVAL_STATUS_ERROR, 133 tool tokens. JSON is a documented supported type. GitHub API requires auth headers tool can’t supply; run 1, run 3 hit quota before test. |
JSON support applies to public, unauthenticated endpoints. Endpoints requiring auth headers return URL_RETRIEVAL_STATUS_ERROR |
| Tool tokens dominate cost at scale | test_1 test_3 test_4 |
20 URLs, tool tokens ~111,400 = ~98.6% of total cost; tool token counts vary <1% between runs; see token scaling table below. | Use tool_use_prompt_token_count for cost est; stable, reproducible, accounts for ~98.6% of cost |
url_context_metadata order non-deterministic |
test_3 test_4 |
Metadata order shuffled relative to input order on every run; pattern itself varied between runs | Match results by retrieved_url string, not array index |
| Gemini’s char count estimates vary within, across runs | test_1 test_3 |
test_1 same URL: 10,950 / 17,476 / 15,221 chars; test_3 same URL in multi-URL context: ~11,360 / ~20,400 / ~11,500. Variance large, non-directional. |
Interpreted character counts not citable; tool_use_prompt_token_count variance <1%, only reproducible proxy for content size |
| Free tier imposes per-minute, per-day limits | test_7 test_8 |
Run 1 429 errors: GenerateRequestsPerMinutePerProjectPerModel-FreeTier limit: 5 RPM; run 3 GenerateRequestsPerDayPerProjectPerModel-FreeTier limit: 20 RPD; 3 runs × ~7 tests exhausted daily quota. |
Free tier: 5 RPM, 20 RPD on gemini-2.5-flash; running interpreted, raw tracks same day exhaust the daily limit; plan across days or use a paid tier |
| Duplicate response in test 3 was non-reproducible | test_3 |
Run 1 produced full results table twice in sequence; runs 2-3 didn’t; non-deterministic model output artifact. | Treat as known non-determinism; raw metadata authoritative source for retrieval counts, statuses |
Token Scaling Details
| Test | URLs | Tool Tokens |
Prompt Tokens |
% Tool |
|---|---|---|---|---|
test_1_single_html |
1 | ~3,145 avg | 65 | ~86.7% |
test_3_multi_url_5 |
5 | ~27,572 avg | 137 | ~96.6% |
test_4_multi_url_20 |
20 | ~111,497 avg | 417 | ~98.6% |
Agent Ecosystem Testing