Key Findings for Gemini’s URL Context Tool, Raw
Test Workflow
- Call the Gemini API with the URL context tool enabled
- Give Gemini a minimal prompt - just enough to trigger URL retrieval
- Gemini fetches each URL via its pre-retrieval step, but isn’t asked
to
interpret, describe, or reflect on what it received - Extract raw retrieval outcomes directly from
url_context_metadatain the
response object -retrieved_urlandurl_retrieval_statusper URL - Extract token accounting from
usage_metadata—tool_use_prompt_token_count,
URL content tokens, andprompt_token_count, text prompt tokensm, recorded separately - Run all analysis in Python: URL counts, status enum enumeration,
success/failure
rates, token breakdowns - Ensure results saved in
google-gemini-url-context/results/raw/
Results Summary
| Test | URLs | URLs OK |
Tool Tokens |
Result |
|---|---|---|---|---|
test_1_single_html |
1 | 1 | 3,099 3,128 |
Consistent across runs |
test_2_single_pdf |
1 | 0 | 119 126 |
URL_RETRIEVAL_STATUS_ERROR consistent |
test_3_multi_url_5 |
5 | 5 | 27,508 27,506 |
Consistent across runs |
test_4_multi_url_20 |
20 | 20 | 111,326 111,326 |
Consistent across runs |
test_5_multi_url_21 |
21 | 0 | null | 400 INVALID_ARGUMENT consistent |
test_6_unsupported_youtube |
1 | 1 | 1,584 1,570 |
URL_RETRIEVAL_STATUS_SUCCESSdocs say unsupported |
test_7_unsupported_google_doc |
1 | 0 | 162 219 |
URL_RETRIEVAL_STATUS_ERROR consistent |
test_8_json_content |
1 | r1,2:0 r4,5:0 |
116 112 |
Non-deterministic succeeded r1-2, failed r4-5 |
5 raw track runs:
gemini-2.5-flash, runs 1–3 on free tier, daily cap exhausted after r3 test 2, runs 4–5 on paid tier; canonical results in run 4-5
Truncation Analysis
| # | Finding | Tests | Observation | Spec Detail |
|---|---|---|---|---|
| 1 | 20-URL limit is hard limit | test_5 r3-5 |
400 INVALID_ARGUMENT: "Number of urls to lookup exceeds the limit (21 > 20)". Zero URL content tokens consumed. Reproduced on all clean runs. |
Limit enforced at the API layer before retrieval. Not truncation or silent dropping. |
| 2 | YouTube succeeds despite being documented as unsupported | test_6 r1r4-5 |
URL_RETRIEVAL_STATUS_SUCCESS on all clean runs. Tool tokens: 1,525 / 1,584 / 1,570variance <4%. |
Documented limitation doesn’t reflect current behavior on gemini-2.5-flash as of March 2026. |
| 3 | PDF retrieval fails consistently on a valid public PDF | test_2 |
URL_RETRIEVAL_STATUS_ERROR every run. Tool tokens: 119–126, minimal, consistent. PDF documented supported type. |
PDF retrieval fails reliably for this W3C URL; follow-up needed with different source before drawing firm conclusion |
| 4 | Google Docs fail at retrieval layer, not API layer | test_7 r1r4-5 |
URL_RETRIEVAL_STATUS_ERROR, tool tokens 156–219. Request completes normally. |
Failure modes: API-layer rejection, hard error, zero tokens, as in test_5 vs. retrieval-layer failure, request completes, status recorded in metadata |
| 5 | JSON API endpoint retrieval is non-deterministic | test_8 |
URL_RETRIEVAL_STATUS_SUCCESS in r1-2 ~2,490 tool tokens; URL_RETRIEVAL_STATUS_ERROR in r4-5, 112–116 tool tokens. No change in endpoint or prompt between runs. |
Handling of application/json responses from this endpoint unreliable; treat JSON API endpoints as non-deterministic until confirmed with a stable public endpoint |
| 6 | Tool tokens dominate cost at scale | test_1 test_3 test_4 r4-5 |
20 URLs, tool tokens 111,326 r4-r5, 0% variance; 5 URLs: 27,506–27,508; 1 URL: 3,099–3,134. | tool_use_prompt_token_count reproducible to <1% across runs, accounts for ~98.6% of total cost at 20 URLs; use for cost estimation |
| 7 | url_context_metadata order non-deterministic |
test_3, test_4 r4-5 |
Metadata order shuffled relative to input order on every run; pattern varies between runs. | Match results by retrieved_url string, not array index |
Token Scaling Details
| Test | URLs | Tool Tokens | % Total |
|---|---|---|---|
test_1_single_html |
1 | ~3,114 avg | ~86% |
test_3_multi_url_5 |
5 | ~27,507 avg | ~98.6% |
test_4_multi_url_20 |
20 | ~111,326 avg | ~98.9% |
Agent Ecosystem Testing