GoogleLLMClient.jl

Julia client for Google's Gemini Developer API (generativelanguage.googleapis.com — not Vertex AI). A sibling to AnthropicClient.jl and GroqClient.jl — same public surface and Reply layout — built for long-running batch and pipeline workloads. Defaults target gemini-3.1-flash-lite.

Features

chat / chat_async against :generateContent with HTTP keep-alive pooling and x-goog-api-key auth.
thinking_level (gemini-3.x) / thinking_budget (2.5) passthrough.
response_schema structured output via responseMimeType + responseSchema (the shape v1beta accepts for both gemini-3.x and 2.5).
Per-client sliding-window RPM semaphore shared across concurrent calls.
Per-reply token + USD cost accounting. output_tokens includes thinking tokens (thoughtsTokenCount), which Gemini bills as output.
Budget wrapper that throws BudgetExceeded on cap.
retry-after-aware 429 handling; bounded exponential backoff on 5xx.
Stub-friendly: body-building and reply-parsing are pure functions.
Base.show never prints the API key.

Install

using Pkg
Pkg.add(url="https://github.com/PelehAI/GoogleLLMClient.jl")

Set your API key in the environment (either name works):

export GEMINI_API_KEY=...      # or GOOGLE_API_KEY

Quick start

using GoogleLLMClient

c = Client(
    api_key       = ENV["GEMINI_API_KEY"],
    model_default = "gemini-3.1-flash-lite",
    rpm           = 15,
)

reply = chat(c;
    system     = "You are a helpful assistant.",
    messages   = [(:user, "Say hi.")],
    max_tokens = 64,
)
@show reply.text reply.cost_usd reply.input_tokens reply.output_tokens

messages accepts Msg, (:user, "...") tuples, or :user => "..." pairs. Roles :user/:assistant map to Gemini's user/model. system becomes the request's systemInstruction.

Thinking

Gemini-3.x uses a thinking level; 2.5 uses a token budget. The client picks the right field from the model id:

# gemini-3.x
chat(c; messages=[(:user, "…")], max_tokens=512, thinking_level="minimal")  # or low/medium/high

# gemini-2.5-flash-lite
chat(c; model="gemini-2.5-flash-lite", messages=[(:user, "…")],
        max_tokens=512, thinking_budget=512)

Thinking tokens are billed as output and are included in reply.output_tokens.

Structured output (JSON)

Pass a JSON schema Dict. It is wired as responseMimeType + responseSchema — the shape v1beta's :generateContent accepts for both gemini-3.x and 2.5:

schema = Dict(
    "type" => "object",
    "properties" => Dict("steps" => Dict("type" => "array",
                                         "items" => Dict("type" => "string"))),
    "required" => ["steps"],
)

reply = chat(c;
    messages        = [(:user, "Outline a talk on caching.")],
    max_tokens      = 512,
    response_schema = schema,
)

Note: the nested responseFormat shape (Vertex / v1alpha) is rejected by v1beta with HTTP 400, so this client always uses the responseMimeType + responseSchema pair — for every model generation.

Caching

Gemini does implicit caching automatically — there is no per-block marker. Hits appear as reply.cached_read_tokens (billed at the discounted cache-read rate); reply.cached_write_tokens is always 0. The cache flag on Msg/SystemPrompt exists only for parity with AnthropicClient.jl and is ignored.

Concurrency, cost, budgets, stub mode

Identical to the sibling clients: chat_async shares one RPM budget; each Reply carries token counts and cost_usd; Budget(c; max_usd=…) enforces a cap; a keyless Client reports has_key(c) == false and chat throws (guard with has_key). See the GroqClient.jl / AnthropicClient.jl READMEs — the APIs match.

Health & speed probes

has_key only tells you a key string is set, not that it works. Two live probes go further — both make minimal real calls (a few output tokens) and never throw:

hc = healthcheck(c)              # one minimal call, classified
hc.ok, hc.status                 # e.g. (true, :ok) or (false, :billing)

sp = speedtest(c; n = 5)         # n concurrent calls under the rpm cap
sp.throughput_rps, sp.latency_median_ms

healthcheck returns a HealthStatus whose status is one of :ok, :no_key, :auth, :quota, :billing, :bad_request, :server, :network, :error — enough for a dashboard to show green/red and say why. speedtest returns a SpeedResult (ok / rate-limited / failed counts, achieved throughput_rps, and min/median/max latency). Both short-circuit on a keyless client.

Testing

julia --project=. -e 'using Pkg; Pkg.instantiate(); Pkg.test()'

All tests are pure-function / wiring-only — no live API calls.

Roadmap

Streaming (:streamGenerateContent)
Multimodal inputs (image / PDF parts)
Explicit caching (Caches API) for guaranteed cache savings
Tool use / function calling

Used by

peleh.ai — academic paper to slide deck.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GoogleLLMClient.jl

Features

Install

Quick start

Thinking

Structured output (JSON)

Caching

Concurrency, cost, budgets, stub mode

Health & speed probes

Testing

Roadmap

Used by

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GoogleLLMClient.jl

Features

Install

Quick start

Thinking

Structured output (JSON)

Caching

Concurrency, cost, budgets, stub mode

Health & speed probes

Testing

Roadmap

Used by

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages