Julia client for Google's Gemini Developer API
(generativelanguage.googleapis.com — not Vertex AI). A sibling to
AnthropicClient.jl and
GroqClient.jl — same public
surface and Reply layout — built for long-running batch and pipeline
workloads. Defaults target gemini-3.1-flash-lite.
chat/chat_asyncagainst:generateContentwith HTTP keep-alive pooling andx-goog-api-keyauth.thinking_level(gemini-3.x) /thinking_budget(2.5) passthrough.response_schemastructured output viaresponseMimeType+responseSchema(the shapev1betaaccepts for both gemini-3.x and 2.5).- Per-client sliding-window RPM semaphore shared across concurrent calls.
- Per-reply token + USD cost accounting.
output_tokensincludes thinking tokens (thoughtsTokenCount), which Gemini bills as output. Budgetwrapper that throwsBudgetExceededon cap.retry-after-aware 429 handling; bounded exponential backoff on 5xx.- Stub-friendly: body-building and reply-parsing are pure functions.
Base.shownever prints the API key.
using Pkg
Pkg.add(url="https://github.com/PelehAI/GoogleLLMClient.jl")Set your API key in the environment (either name works):
export GEMINI_API_KEY=... # or GOOGLE_API_KEYusing GoogleLLMClient
c = Client(
api_key = ENV["GEMINI_API_KEY"],
model_default = "gemini-3.1-flash-lite",
rpm = 15,
)
reply = chat(c;
system = "You are a helpful assistant.",
messages = [(:user, "Say hi.")],
max_tokens = 64,
)
@show reply.text reply.cost_usd reply.input_tokens reply.output_tokensmessages accepts Msg, (:user, "...") tuples, or :user => "..." pairs.
Roles :user/:assistant map to Gemini's user/model. system becomes the
request's systemInstruction.
Gemini-3.x uses a thinking level; 2.5 uses a token budget. The client picks the right field from the model id:
# gemini-3.x
chat(c; messages=[(:user, "…")], max_tokens=512, thinking_level="minimal") # or low/medium/high
# gemini-2.5-flash-lite
chat(c; model="gemini-2.5-flash-lite", messages=[(:user, "…")],
max_tokens=512, thinking_budget=512)Thinking tokens are billed as output and are included in reply.output_tokens.
Pass a JSON schema Dict. It is wired as responseMimeType +
responseSchema — the shape v1beta's :generateContent accepts for both
gemini-3.x and 2.5:
schema = Dict(
"type" => "object",
"properties" => Dict("steps" => Dict("type" => "array",
"items" => Dict("type" => "string"))),
"required" => ["steps"],
)
reply = chat(c;
messages = [(:user, "Outline a talk on caching.")],
max_tokens = 512,
response_schema = schema,
)Note: the nested
responseFormatshape (Vertex /v1alpha) is rejected byv1betawith HTTP 400, so this client always uses theresponseMimeType+responseSchemapair — for every model generation.
Gemini does implicit caching automatically — there is no per-block marker.
Hits appear as reply.cached_read_tokens (billed at the discounted cache-read
rate); reply.cached_write_tokens is always 0. The cache flag on
Msg/SystemPrompt exists only for parity with AnthropicClient.jl and is
ignored.
Identical to the sibling clients: chat_async shares one RPM budget; each
Reply carries token counts and cost_usd; Budget(c; max_usd=…) enforces a
cap; a keyless Client reports has_key(c) == false and chat throws (guard
with has_key). See the GroqClient.jl / AnthropicClient.jl READMEs — the
APIs match.
has_key only tells you a key string is set, not that it works. Two live
probes go further — both make minimal real calls (a few output tokens) and
never throw:
hc = healthcheck(c) # one minimal call, classified
hc.ok, hc.status # e.g. (true, :ok) or (false, :billing)
sp = speedtest(c; n = 5) # n concurrent calls under the rpm cap
sp.throughput_rps, sp.latency_median_mshealthcheck returns a HealthStatus whose status is one of :ok,
:no_key, :auth, :quota, :billing, :bad_request, :server,
:network, :error — enough for a dashboard to show green/red and say why.
speedtest returns a SpeedResult (ok / rate-limited / failed counts, achieved
throughput_rps, and min/median/max latency). Both short-circuit on a keyless
client.
julia --project=. -e 'using Pkg; Pkg.instantiate(); Pkg.test()'All tests are pure-function / wiring-only — no live API calls.
- Streaming (
:streamGenerateContent) - Multimodal inputs (image / PDF parts)
- Explicit caching (Caches API) for guaranteed cache savings
- Tool use / function calling
- peleh.ai — academic paper to slide deck.
MIT. See LICENSE.