Skip to content

PelehAI/GoogleLLMClient.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GoogleLLMClient.jl

CI License: MIT

Julia client for Google's Gemini Developer API (generativelanguage.googleapis.comnot Vertex AI). A sibling to AnthropicClient.jl and GroqClient.jl — same public surface and Reply layout — built for long-running batch and pipeline workloads. Defaults target gemini-3.1-flash-lite.

Features

  • chat / chat_async against :generateContent with HTTP keep-alive pooling and x-goog-api-key auth.
  • thinking_level (gemini-3.x) / thinking_budget (2.5) passthrough.
  • response_schema structured output via responseMimeType + responseSchema (the shape v1beta accepts for both gemini-3.x and 2.5).
  • Per-client sliding-window RPM semaphore shared across concurrent calls.
  • Per-reply token + USD cost accounting. output_tokens includes thinking tokens (thoughtsTokenCount), which Gemini bills as output.
  • Budget wrapper that throws BudgetExceeded on cap.
  • retry-after-aware 429 handling; bounded exponential backoff on 5xx.
  • Stub-friendly: body-building and reply-parsing are pure functions.
  • Base.show never prints the API key.

Install

using Pkg
Pkg.add(url="https://github.com/PelehAI/GoogleLLMClient.jl")

Set your API key in the environment (either name works):

export GEMINI_API_KEY=...      # or GOOGLE_API_KEY

Quick start

using GoogleLLMClient

c = Client(
    api_key       = ENV["GEMINI_API_KEY"],
    model_default = "gemini-3.1-flash-lite",
    rpm           = 15,
)

reply = chat(c;
    system     = "You are a helpful assistant.",
    messages   = [(:user, "Say hi.")],
    max_tokens = 64,
)
@show reply.text reply.cost_usd reply.input_tokens reply.output_tokens

messages accepts Msg, (:user, "...") tuples, or :user => "..." pairs. Roles :user/:assistant map to Gemini's user/model. system becomes the request's systemInstruction.

Thinking

Gemini-3.x uses a thinking level; 2.5 uses a token budget. The client picks the right field from the model id:

# gemini-3.x
chat(c; messages=[(:user, "")], max_tokens=512, thinking_level="minimal")  # or low/medium/high

# gemini-2.5-flash-lite
chat(c; model="gemini-2.5-flash-lite", messages=[(:user, "")],
        max_tokens=512, thinking_budget=512)

Thinking tokens are billed as output and are included in reply.output_tokens.

Structured output (JSON)

Pass a JSON schema Dict. It is wired as responseMimeType + responseSchema — the shape v1beta's :generateContent accepts for both gemini-3.x and 2.5:

schema = Dict(
    "type" => "object",
    "properties" => Dict("steps" => Dict("type" => "array",
                                         "items" => Dict("type" => "string"))),
    "required" => ["steps"],
)

reply = chat(c;
    messages        = [(:user, "Outline a talk on caching.")],
    max_tokens      = 512,
    response_schema = schema,
)

Note: the nested responseFormat shape (Vertex / v1alpha) is rejected by v1beta with HTTP 400, so this client always uses the responseMimeType + responseSchema pair — for every model generation.

Caching

Gemini does implicit caching automatically — there is no per-block marker. Hits appear as reply.cached_read_tokens (billed at the discounted cache-read rate); reply.cached_write_tokens is always 0. The cache flag on Msg/SystemPrompt exists only for parity with AnthropicClient.jl and is ignored.

Concurrency, cost, budgets, stub mode

Identical to the sibling clients: chat_async shares one RPM budget; each Reply carries token counts and cost_usd; Budget(c; max_usd=…) enforces a cap; a keyless Client reports has_key(c) == false and chat throws (guard with has_key). See the GroqClient.jl / AnthropicClient.jl READMEs — the APIs match.

Health & speed probes

has_key only tells you a key string is set, not that it works. Two live probes go further — both make minimal real calls (a few output tokens) and never throw:

hc = healthcheck(c)              # one minimal call, classified
hc.ok, hc.status                 # e.g. (true, :ok) or (false, :billing)

sp = speedtest(c; n = 5)         # n concurrent calls under the rpm cap
sp.throughput_rps, sp.latency_median_ms

healthcheck returns a HealthStatus whose status is one of :ok, :no_key, :auth, :quota, :billing, :bad_request, :server, :network, :error — enough for a dashboard to show green/red and say why. speedtest returns a SpeedResult (ok / rate-limited / failed counts, achieved throughput_rps, and min/median/max latency). Both short-circuit on a keyless client.

Testing

julia --project=. -e 'using Pkg; Pkg.instantiate(); Pkg.test()'

All tests are pure-function / wiring-only — no live API calls.

Roadmap

  • Streaming (:streamGenerateContent)
  • Multimodal inputs (image / PDF parts)
  • Explicit caching (Caches API) for guaranteed cache savings
  • Tool use / function calling

Used by

  • peleh.ai — academic paper to slide deck.

License

MIT. See LICENSE.

About

Julia client for Google's Gemini Developer API (Flash-Lite). Sibling to AnthropicClient.jl.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages