Agent Ecosystem Testing GitHub ↗

Gemini-interpreted vs Raw

Two Python scripts test the same URL context behaviors:

Gemini-interpreted captures what Gemini believes it retrieved: how much content it saw, whether a fetch felt complete, how it characterizes a failure. This is the model’s self-report, while the raw track captures what the API actually returned: exact url_retrieval_status codes, exact token counts from usage_metadata, exact URL counts in url_context_metadata. These are Python len() calls and dictionary lookups and not model estimates.

The gap between these two tracks is itself a finding. If Gemini reports a successful fetch in prose but the raw url_retrieval_status is URL_RETRIEVAL_STATUS_ERROR, that discrepancy belongs in the spec.

  url_context_test.py url_context_test_raw.py
Measures Gemini’s interpretation of what it fetched Raw metadata extracted directly from response object
URL counts Gemini estimates may vary between runs Python len() on url_context_metadata.url_metadata- exact
Retrieval status Gemini’s prose description of success/failure url_retrieval_status string from API - exact enum value
Token counts Gemini may reference or ignore token usage usage_metadata.tool_use_prompt_token_count - exact, reproducible
max_output_tokens Not set - Gemini writes full assessments Set to 128 - minimal output, metadata is the signal
Token cost per run Higher - Gemini writes long self-assessments Lower - minimal prompt, capped output
Best used for Understanding what the model perceives it received Citable measurements for the spec

Gemini URL Context vs Claude Web Fetch

The Claude web fetch track tests web_fetch - a mid-generation tool call:

1. Claude decides when to fetch based on prompts and/or URL availability
2. Claude API retrieves the content
3. The content comes back as a tool result
4. Claude continues generation, interpreting the tool result

Gemini’s URL context tool completes retrieval before generation begins:

1. Gemini API attempts to fetch each URL from an internal index cache
2. If not cached, the API falls back to a live fetch
3. URL context tool injects retrieved content into the `gemini-2.5-flash` context window
4. `gemini-2.5-flash` generates a response from the pre-loaded content

This architecture makes the raw track cleaner to implement than the Claude track — url_context_metadata is a first-class response field rather than something extracted from tool use blocks, and token accounting splits between text prompt: prompt_token_count and retrieved URL content: tool_use_prompt_token_count; the two cost components are already separated in the response object.

Character Count Estimation Variation Hypothesis

Unlike Claude Code, where Haiku summarizes before the main model sees the content, Claude’s web fetch and Gemini’s URL context handle summarization in-model. In this test suite, gemini-2.5-flash is doing both the retrieval orchestration and the generation - receives the retrieved content directly and summarizes in one step. Claude’s web fetch similarly filters fetched content via code execution rather than delegating to a separate model and exposes truncation behavior through observable cutoffs and max_content_tokens. Gemini’s URL context, by contrast, treats summarization as a use case rather than a pipeline stage - with no equivalent parameter or cutoff boundary - making it harder to test whether summarization is truncating characters or simply rephrasing them.

Without a field exposing content injected into the context window, run-to-run differences can’t be definitively attributed; Gemini’s two-step retrieval process means internal cache vs. live fetch could deliver different content across runs. But the raw token counts being stable - <1% variance - suggest the retrieved content was consistent, pointing to gemini-2.5-flash summarizing the same content differently across runs rather than receiving different content.


Agent Docs Spec - Known Platform Limits

Both tracks agree on the following and are the citable findings for the spec:

URL Limits

Retrieval Status

Token Accounting

Rate Limits - Free Tier,gemini-2.5-flash