Agent Ecosystem Testing GitHub ↗

Key Findings for Gemini’s URL Context Tool, Gemini-interpreted


Gemini-interpreted Test Workflow:

1. Call the Gemini API with the URL context tool enabled
2. Give Gemini a detailed prompt asking it to describe what it retrieved:
   content length, structure, completeness, any failures
3. Gemini fetches each URL via its pre-retrieval step, then generates a
   response reflecting on what it received
4. Capture Gemini's full text response as the interpreted finding
5. Also capture `url_context_metadata` and `usage_metadata` for
   cross-referencing against the raw track
6. The gap between Gemini's self-report and the raw metadata is itself
   a finding, and discrepancies belong in the spec
7. Results stored in `gemini-url-context/results/gemini-interpreted/`

Results Summary

3 runs: gemini-2.5-flash, free tier daily cap 20 RPD reached after run 3 test 7

Test URLs Req URLs OK Tokens - r1/r2/r3 Result
test_1_single_html 1 1 3,142/3,151/3,141 Consistent across runs
test_2_single_pdf 1 0 147/151/156 URL_RETRIEVAL_STATUS_ERROR
test_3_multi_url_5 5 5 27,564/27,579/27,572 Consistent across runs
test_4_multi_url_20 20 20 111,401/111,714/111,375 Consistent across runs
test_5_multi_url_21 21 0 400 INVALID_ARGUMENT
test_6_unsupported_youtube 1 1 1,288/1,291/1,570 Succeeded - docs say unsupported. Token variance in run 3
test_7_unsupported_google_doc 1 0 — / 181/192 429 and URL_RETRIEVAL_STATUS_ERROR - r1 rate-limited, r2 & r3 confirm
test_8_json_content 1 0 — /133/— 429 and r2: URL_RETRIEVAL_STATUS_ERROR - r1 & r3 daily quota exhausted
# Finding Tests Observed Spec contribution
1 20-URL limit is a hard API error test_5 all 3 runs 400 INVALID_ARGUMENT: "Number of urls to lookup exceeds the limit (21 > 20)". url_context_metadata empty, zero URL content tokens consumed. Reproduced on all 3 runs. Limit is enforced at the API layer before retrieval. Not truncation or silent dropping.
2 YouTube succeeds despite being documented as unsupported test_6 all 3 runs URL_RETRIEVAL_STATUS_SUCCESS all 3 runs. Tool tokens: 1,288 / 1,291 / 1,570. Run 3 returned ~22% more tokens, suggesting live-fetch vs. cache variation. Documented limitation doesn’t reflect current behavior on gemini-2.5-flash as of March 2026. Token variance across runs suggests cache vs. live-fetch switching.
3 PDF retrieval failed consistently on a valid public PDF test_2 all 3 runs URL_RETRIEVAL_STATUS_ERROR all 3 runs. Tool tokens: 147 / 151 / 156 - minimal, consistent error response. PDF is a documented supported type. PDF retrieval fails reliably for this W3C URL. Follow-up needed with a different PDF source before drawing a firm conclusion.
4 Google Docs fail at the retrieval layer, not the API layer test_7 runs 2 & 3 URL_RETRIEVAL_STATUS_ERROR with tool tokens 181 / 192. Request completes normally - no API-level error. Contrasts with test 5 which rejected at the API layer. Two distinct failure modes exist: API-layer rejection; hard error, zero retrieval vs. retrieval-layer failure, request completes, status in metadata.
5 JSON API endpoint failed retrieval test_8 r2 only URL_RETRIEVAL_STATUS_ERROR, 133 tool tokens. JSON is a documented supported type. GitHub API requires auth headers the tool can’t supply; r1 & r3 hit quota before this test. JSON support applies to public, unauthenticated endpoints. Endpoints requiring auth headers will return URL_RETRIEVAL_STATUS_ERROR.
6 Tool tokens dominate cost at scale: stable across runs test_1, test_3, test_4 all 3 runs At 20 URLs, tool tokens ~111,400 = ~98.6% of total cost across all 3 runs. Tool token counts vary <1% between runs. See token scaling table below. Use tool_use_prompt_token_count for cost estimation; it’s stable, reproducible, and accounts for ~98.6% of cost at 20 URLs.
7 url_context_metadata order is non-deterministic test_3, test_4 all 3 runs Metadata order shuffled relative to input order on every run. Shuffle pattern itself varied between runs, not a stable reordering. Match results by retrieved_url string, not array index.
8 Gemini’s character count estimates vary within and across runs test_1 across 3 runs, test_3 across 3 runs Test 1 same URL: 10,950 / 17,476 / 15,221 chars. Test 3 same URL in multi-URL context: ~11,360 / ~20,400 / ~11,500. Variance is large and non-directional. Interpreted character counts aren’t citable. tool_use_prompt_token_count - variance <1%, is the only reproducible proxy for content size.
9 Free tier imposes both per-minute and per-day limits test_7 run 1, test_8 run 3 Two distinct 429 errors: GenerateRequestsPerMinutePerProjectPerModel-FreeTier limit: 5 RPM in r1; GenerateRequestsPerDayPerProjectPerModel-FreeTier limit: 20 RPD in r3. 3 runs × ~7 tests exhausted the daily quota. Free tier: 5 RPM and 20 RPD on gemini-2.5-flash. Running interpreted + raw tracks in the same day will exhaust the daily limit. Plan across days or use a paid tier.
10 Duplicate response in test 3 was non-reproducible test_3 r1 only r1 produced the full results table twice in sequence; r2 & r3 didn’t reproduce it. Non-deterministic model output artifact. Treat as known non-determinism. Raw metadata is the authoritative source for retrieval counts and statuses.

Token scaling detail - averages across 3 runs

Test URLs Tool tokens - avg Prompt tokens % tool
test_1_single_html 1 3,145 65 ~86.7%
test_3_multi_url_5 5 27,572 137 ~96.6%
test_4_multi_url_20 20 111,497 417 ~98.6%