A self-hosted, natural-language personal calendar. You talk to a Telegram bot (text or voice); an LLM agent translates your message into calendar operations through an MCP (Model Context Protocol) server backed by Supabase; a Next.js PWA gives you a Today/Week/Month view of the same data with offline support and push notifications.
Telegram (text / voice)
│
▼
┌─────────────────┐ tool calls ┌──────────────────┐
│ agent/ │ ─────────────► │ mcp-server/ │
│ grammY bot │ (MCP HTTP) │ 6 calendar tools │
│ OpenAI LLM │ ◄───────────── │ stdio + HTTP │
│ Whisper (voice) │ results └────────┬─────────┘
└─────────────────┘ │ service role
▼
┌──────────────────┐
│ Supabase │
│ Postgres + RLS │
│ Auth + Realtime │
└────────┬─────────┘
│ anon key + RLS
▼
┌──────────────────┐
│ pwa/ │
│ Next.js viewer │
│ offline + push │
└──────────────────┘
| Path | Purpose |
|---|---|
mcp-server/ |
Calendar MCP server (Node/TypeScript). Exposes 6 tools over stdio and HTTP transports; owns all database writes. |
agent/ |
Telegram agent: grammY long-polling bot + OpenAI tool-calling loop + Whisper voice transcription. Talks to the MCP server over HTTP. |
pwa/ |
Next.js calendar viewer (Today/Week/Month), installable PWA with an IndexedDB cache, Supabase Realtime sync, soft delete, and web push. |
supabase/migrations/ |
Database schema: events, tool invocation log, RLS policies, views, realtime, push subscriptions. |
evals/ |
Regression eval harness for the agent (30 fixture cases, mock mode + optional LLM-as-judge). |
scripts/ |
Operator CLIs: MCP smoke tester, prompt-dev REPL, OpenAI smoke suite, eval cleanup. |
docker/ |
Dockerfiles (mcp, agent, evals) + local/eval compose stacks. |
deploy/jetson/ |
Production deployment to a Jetson Nano: compose, systemd units, deploy/rollback/health-check scripts. |
docs/OPERATIONS.md |
Operations runbook for the production appliance. |
- You message the bot on Telegram — text, or a voice note (transcribed via OpenAI Whisper). Access is restricted to an allowlist of Telegram user IDs.
- The agent resolves intent and time. A system prompt (
agent/src/system-prompt.ts) instructs the LLM to convert relative expressions ("tomorrow at 3", "next Thursday") into absolute ISO 8601 timestamps in your timezone, ask for clarification only when information is genuinely missing, and otherwise call tools directly with zero confirmation round-trips. - Tools execute against the MCP server. Six tools cover the calendar surface:
add_event,list_events,update_event,delete_event,find_free_slot,check_availability. Updates and deletes accept either a UUID or a fuzzy title reference ("my dentist appointment"); ambiguity is returned to the LLM, which asks you which one you meant. Deletes are soft (adeleted_attimestamp), and every tool invocation is logged to an audit table. - The PWA reads the same data with the Supabase anon key under row-level security. It caches events in IndexedDB (Dexie) for offline use, subscribes to Supabase Realtime for live updates, and can deliver web push notifications.
The agent mirrors your language — write in English, Turkish, or anything the model handles, and it replies in kind.
LLM behavior here is treated as a production system, not a prompt pasted into an API call. The practices below are what keep a non-deterministic component shippable.
The system prompt lives in version control (agent/src/system-prompt.ts) and is snapshot-tested: any change fails CI until the snapshot is explicitly regenerated (vitest -u), so prompt drift is always a reviewed diff, never an accident. On top of the snapshot, behavioral assertions pin each policy individually — the past-time guard, the single-number-hour table, the did-you-mean fallback, the "never claim success unless the tool succeeded" rule. Removing a rule fails a named test, not a vibe check.
edit system-prompt.ts
│
▼
pnpm chat interactive REPL replicating the production pipeline
│ (real OpenAI + isolated MCP container + real DB,
│ /reload re-reads the prompt without a rebuild)
▼
vitest -u review + accept the snapshot diff deliberately
│
▼
pnpm eval:ci 30-case regression suite, mock mode (CI, no LLM cost)
│
▼
pnpm smoke:openai live end-to-end scenarios against the real model
│
▼
deploy
The eval fixtures (evals/fixtures/regression-cases.json) encode behaviors that once regressed in real usage — typo normalization, ambiguous bare hours, past-time clarification, fuzzy delete ambiguity — so a prompt "improvement" cannot silently reintroduce an old failure.
- Deterministic tests — the OpenAI SDK is mocked; the tool-calling loop, session handling, and Telegram surface are covered by 160+ unit/integration tests. The MCP tool contracts are tested against a real Postgres.
- Mock-mode evals (CI) — the eval runner replays each fixture and scores tool selection and arguments against the
tool_invocationsaudit table, not against fragile response text. Deterministic, free, runs on every push (.github/workflows/eval.yml). - Live smoke —
pnpm smoke:openairuns real-model scenarios end-to-end before deploys. Model changes are gated by it: the production model was chosen after repeated all-green runs, not by benchmark folklore. An LLM-as-judge stage for response quality (separate Anthropic judge, spend isolated viaEVAL_OPENAI_API_KEY) is scaffolded behind explicit env gates.
The LLM can only emit tool calls; it holds no credentials and writes nothing directly. The MCP server is the narrow waist: every input is Zod-validated, fuzzy references are resolved in code (ambiguity is returned to the model as structured data, which then asks the user), deletes are soft, and past-time writes are rejected server-side regardless of what the prompt resolved. Every tool invocation is audited to a tool_invocations table (input, output, error, duration, model) — which doubles as the ground truth the eval harness scores against.
The orchestration is ~50 lines of explicit code over the vanilla OpenAI SDK — no agent framework. The tool-call loop is capped (OPENAI_MAX_TOOL_ITERATIONS), MCP calls time out at 30s, per-user history is truncated pair-preservingly (SESSION_MAX_MESSAGES) so context can't grow unboundedly, and the Telegram allowlist rejects unknown users before any model call. Failure behavior is designed, not emergent: when information is complete the agent acts with zero confirmation round-trips; when it's missing the agent asks instead of inventing; timeouts produce an honest "couldn't reach the server" rather than a hallucinated success.
Every write is tagged with its origin (MCP_SOURCE: telegram / pwa / smoke / eval / import), and test traffic runs under dedicated user UUIDs — evals and smoke runs can never touch production calendar data, and eval artifacts are cleaned up by pnpm eval:cleanup.
- Node.js ≥ 22, pnpm ≥ 9
- A Supabase project (cloud — no local Postgres needed)
- A Telegram bot token from @BotFather
- An OpenAI API key (chat completions + Whisper)
- Optional: Docker (for the containerized stack and the prompt-dev REPL), a Vercel account (for hosting the PWA)
git clone <this repo>
cd calendar-assistant
pnpm install- Create a Supabase project and note the project ref.
- Apply the migrations in order: open the Dashboard SQL Editor (
https://supabase.com/dashboard/project/<your-ref>/sql/new) and run each file insupabase/migrations/(they are timestamp-ordered). Alternatively, link the Supabase CLI (supabase init,supabase link --project-ref <ref>) and usepnpm db:push. - Create your user: in Authentication, create a user (email magic link is what the PWA login uses). Copy the user's UUID — this is your
USER_ID. All calendar rows are scoped to it.
cp .env.example .envFill in .env at the repo root. Every variable is documented inline in .env.example; summary:
| Variable | Used by | Purpose |
|---|---|---|
SUPABASE_URL |
mcp-server, agent | Project API URL (https://<ref>.supabase.co). |
SUPABASE_ANON_KEY |
tests, PWA | Public anon key. |
SUPABASE_SERVICE_ROLE_KEY |
mcp-server | Server-side key; bypasses RLS. Never expose to the PWA. |
USER_ID |
mcp-server | The auth user UUID all events belong to. |
USER_TIMEZONE |
mcp-server, agent | IANA timezone for natural-language time resolution (default Europe/Istanbul). |
TELEGRAM_BOT_TOKEN |
agent | Token from @BotFather. |
TELEGRAM_ALLOW_USER_IDS |
agent | Comma-separated numeric Telegram user IDs allowed to use the bot. |
OPENAI_API_KEY |
agent | Chat completions + Whisper transcription. |
OPENAI_MODEL, OPENAI_TEMPERATURE, OPENAI_MAX_TOOL_ITERATIONS |
agent | Model knobs (defaults: gpt-4.1-mini, 0.3, 8). |
MCP_URL, MCP_REQUEST_TIMEOUT_MS |
agent | MCP endpoint (default http://127.0.0.1:3001/mcp; compose overrides to http://mcp:3001/mcp). |
SESSION_MAX_MESSAGES, LOG_LEVEL |
agent | Chat-history cap per user; pino log level. |
NEXT_PUBLIC_SUPABASE_URL, NEXT_PUBLIC_SUPABASE_ANON_KEY |
PWA | Browser-safe mirrors of the Supabase URL/anon key. |
NEXT_PUBLIC_VAPID_PUBLIC_KEY, VAPID_PRIVATE_KEY, VAPID_SUBJECT |
PWA | Web push keys — generate with npx --yes web-push generate-vapid-keys --json. |
ALLOW_CLOUD_TESTS |
tests | Safety latch: integration tests refuse to run against a non-local SUPABASE_URL unless set to 1. |
EVAL_* |
evals | Eval harness mode, isolated API keys, judge toggle — see Evals. |
The full per-variable reference for the agent lives in agent/README.md; PWA-specific setup (magic-link auth config, Vercel flow) lives in pwa/README.md.
pnpm --filter @personal-calendar/mcp-server build
node mcp-server/dist/transport/http.js # HTTP transport on :3001Verify it end-to-end without any LLM using the smoke CLI:
pnpm smoke list-tools
pnpm smoke call add_event '{"title":"test","start_at":"2026-06-05T18:00:00+03:00"}'
pnpm smoke call list_events '{"range_start":"2026-06-01T00:00:00+03:00","range_end":"2026-06-30T23:59:59+03:00"}'
pnpm smoke call delete_event '{"event_ref":"test","hard_delete":true}'Local (node):
pnpm --filter @personal-calendar/agent build
node agent/dist/index.jsOr run both services as containers:
docker compose -f docker/docker-compose.local.yml up --buildTelegram allows exactly one long-poll client per bot token. If the stack also runs somewhere else (e.g. the production appliance), stop one before starting the other, or the bot becomes flaky with 409 conflicts.
Message your bot: "dentist tomorrow at 14:00" → Added: … ✓. Send a voice note saying the same thing — it goes through Whisper and lands in the same pipeline.
pnpm dev # next dev on :3000Log in with the email of the Supabase auth user you created (magic link). Deploy with Vercel (pnpm vercel deploy — the CLI is a workspace devDependency); set the NEXT_PUBLIC_* and VAPID variables in the Vercel project. Details: pwa/README.md.
Create the bot. Talk to @BotFather → /newbot → copy the token into TELEGRAM_BOT_TOKEN.
Restrict access. The agent hard-rejects anyone not in TELEGRAM_ALLOW_USER_IDS. Get your numeric ID by messaging a bot like @userinfobot, then set e.g. TELEGRAM_ALLOW_USER_IDS=123456789 (comma-separate multiple IDs).
Customize behavior. The agent's entire personality and decision policy live in one file: agent/src/system-prompt.ts. It defines the decision procedure (when to act vs. when to ask), time-resolution rules (e.g. a bare "at 3" is assumed to be 15:00, while "at 9" triggers an "9 am or 9 pm?" question), typo handling, and response formatting. The prompt is snapshot-tested: after editing it, run
pnpm --filter @personal-calendar/agent test -- -u # review the snapshot diff deliberatelyIterate on the prompt without redeploying. pnpm chat starts a REPL that replicates the production pipeline (real OpenAI + a dedicated smoke MCP container on :3003 + real Supabase, scoped to a reserved smoke user and cleaned up on exit):
pnpm chat
# /reload re-read agent/src/system-prompt.ts (no rebuild)
# /reset clear conversation history
# /quit exit
# Batch mode for scripted scenarios
printf 'add dentist tomorrow at 14:00\nwhat do I have this week\n' | pnpm chat --json
pnpm chat --message "meeting Friday at 10:00" --json
pnpm chat --keep --message "..." # reuse the container across runs
pnpm chat:down # stop itDo not run pnpm chat and pnpm smoke:openai simultaneously — they share the smoke user, port 3003, and container name.
Tune the model. OPENAI_MODEL, OPENAI_TEMPERATURE, and OPENAI_MAX_TOOL_ITERATIONS are plain env vars — no code changes needed to try a different model.
Change the timezone. Set USER_TIMEZONE for the agent/MCP server. Note the system prompt's examples assume UTC+3 (Europe/Istanbul); if you move far away, adjust the prompt's datetime section to match.
pnpm -r test # all workspaces (agent, mcp-server, pwa, evals)
pnpm run test:scripts # script-layer unit testsmcp-serverintegration tests run against your cloud Supabase and requireALLOW_CLOUD_TESTS=1in.env(they use isolated fixtures and clean up after themselves).agenttests are fully mocked (no network) and enforce coverage thresholds.evals/tests/runner.test.tsandpnpm smoke:openaineed Docker (they spin up an MCP container).
evals/ is a regression harness for agent behavior: 30 fixture conversations (evals/fixtures/regression-cases.json) covering add/list/update/delete/availability, typos, ambiguous hours, and past-time handling.
pnpm eval:ci # mock mode — validates tool selection + arguments, no LLM cost
pnpm eval:cleanup # remove eval artifacts from the databaseMock mode is the default (EVAL_MODE=mock) and is what CI runs (.github/workflows/eval.yml). Full mode (real LLM + Claude-based response judge, gated behind EVAL_OPENAI_API_KEY / ANTHROPIC_API_KEY / EVAL_JUDGE_ENABLED=1) is scaffolded but the judge client is intentionally still a stub.
The reference deployment runs 24/7 on a Jetson Nano under a systemd-managed Docker Compose stack:
agent(long-poll + OpenAI + MCP HTTP + Whisper) andmcpcontainers on a private bridge network.- Survives reboots (~200s from
sudo rebootto fully active), restarts on crash, log rotation capped at 10MB×3 per container, weekly Docker prune. - Images are built natively on the device (arm64) — no registry required.
Setup history and runbook: deploy/jetson/README.md. Day-2 operations (health checks, recovery, backup posture): docs/OPERATIONS.md. Any Docker host works the same way via docker/docker-compose.local.yml — Jetson is just where this instance lives.