Skip to content

Spawnfile/calendar-assistant

Repository files navigation

Personal Calendar Assistant

A self-hosted, natural-language personal calendar. You talk to a Telegram bot (text or voice); an LLM agent translates your message into calendar operations through an MCP (Model Context Protocol) server backed by Supabase; a Next.js PWA gives you a Today/Week/Month view of the same data with offline support and push notifications.

 Telegram (text / voice)
        │
        ▼
 ┌─────────────────┐   tool calls    ┌──────────────────┐
 │  agent/          │ ─────────────► │  mcp-server/      │
 │  grammY bot      │   (MCP HTTP)   │  6 calendar tools │
 │  OpenAI LLM      │ ◄───────────── │  stdio + HTTP     │
 │  Whisper (voice) │    results     └────────┬─────────┘
 └─────────────────┘                          │ service role
                                              ▼
                                     ┌──────────────────┐
                                     │  Supabase         │
                                     │  Postgres + RLS   │
                                     │  Auth + Realtime  │
                                     └────────┬─────────┘
                                              │ anon key + RLS
                                              ▼
                                     ┌──────────────────┐
                                     │  pwa/             │
                                     │  Next.js viewer   │
                                     │  offline + push   │
                                     └──────────────────┘

Repository layout

Path Purpose
mcp-server/ Calendar MCP server (Node/TypeScript). Exposes 6 tools over stdio and HTTP transports; owns all database writes.
agent/ Telegram agent: grammY long-polling bot + OpenAI tool-calling loop + Whisper voice transcription. Talks to the MCP server over HTTP.
pwa/ Next.js calendar viewer (Today/Week/Month), installable PWA with an IndexedDB cache, Supabase Realtime sync, soft delete, and web push.
supabase/migrations/ Database schema: events, tool invocation log, RLS policies, views, realtime, push subscriptions.
evals/ Regression eval harness for the agent (30 fixture cases, mock mode + optional LLM-as-judge).
scripts/ Operator CLIs: MCP smoke tester, prompt-dev REPL, OpenAI smoke suite, eval cleanup.
docker/ Dockerfiles (mcp, agent, evals) + local/eval compose stacks.
deploy/jetson/ Production deployment to a Jetson Nano: compose, systemd units, deploy/rollback/health-check scripts.
docs/OPERATIONS.md Operations runbook for the production appliance.

How it works

  1. You message the bot on Telegram — text, or a voice note (transcribed via OpenAI Whisper). Access is restricted to an allowlist of Telegram user IDs.
  2. The agent resolves intent and time. A system prompt (agent/src/system-prompt.ts) instructs the LLM to convert relative expressions ("tomorrow at 3", "next Thursday") into absolute ISO 8601 timestamps in your timezone, ask for clarification only when information is genuinely missing, and otherwise call tools directly with zero confirmation round-trips.
  3. Tools execute against the MCP server. Six tools cover the calendar surface: add_event, list_events, update_event, delete_event, find_free_slot, check_availability. Updates and deletes accept either a UUID or a fuzzy title reference ("my dentist appointment"); ambiguity is returned to the LLM, which asks you which one you meant. Deletes are soft (a deleted_at timestamp), and every tool invocation is logged to an audit table.
  4. The PWA reads the same data with the Supabase anon key under row-level security. It caches events in IndexedDB (Dexie) for offline use, subscribes to Supabase Realtime for live updates, and can deliver web push notifications.

The agent mirrors your language — write in English, Turkish, or anything the model handles, and it replies in kind.

Prompt & LLM behavior management

LLM behavior here is treated as a production system, not a prompt pasted into an API call. The practices below are what keep a non-deterministic component shippable.

The prompt is a tested artifact

The system prompt lives in version control (agent/src/system-prompt.ts) and is snapshot-tested: any change fails CI until the snapshot is explicitly regenerated (vitest -u), so prompt drift is always a reviewed diff, never an accident. On top of the snapshot, behavioral assertions pin each policy individually — the past-time guard, the single-number-hour table, the did-you-mean fallback, the "never claim success unless the tool succeeded" rule. Removing a rule fails a named test, not a vibe check.

Prompt changes go through a pipeline

 edit system-prompt.ts
   │
   ▼
 pnpm chat            interactive REPL replicating the production pipeline
   │                  (real OpenAI + isolated MCP container + real DB,
   │                   /reload re-reads the prompt without a rebuild)
   ▼
 vitest -u            review + accept the snapshot diff deliberately
   │
   ▼
 pnpm eval:ci         30-case regression suite, mock mode (CI, no LLM cost)
   │
   ▼
 pnpm smoke:openai    live end-to-end scenarios against the real model
   │
   ▼
 deploy

The eval fixtures (evals/fixtures/regression-cases.json) encode behaviors that once regressed in real usage — typo normalization, ambiguous bare hours, past-time clarification, fuzzy delete ambiguity — so a prompt "improvement" cannot silently reintroduce an old failure.

Three layers of verification for non-deterministic behavior

  1. Deterministic tests — the OpenAI SDK is mocked; the tool-calling loop, session handling, and Telegram surface are covered by 160+ unit/integration tests. The MCP tool contracts are tested against a real Postgres.
  2. Mock-mode evals (CI) — the eval runner replays each fixture and scores tool selection and arguments against the tool_invocations audit table, not against fragile response text. Deterministic, free, runs on every push (.github/workflows/eval.yml).
  3. Live smokepnpm smoke:openai runs real-model scenarios end-to-end before deploys. Model changes are gated by it: the production model was chosen after repeated all-green runs, not by benchmark folklore. An LLM-as-judge stage for response quality (separate Anthropic judge, spend isolated via EVAL_OPENAI_API_KEY) is scaffolded behind explicit env gates.

The model is never trusted with the database

The LLM can only emit tool calls; it holds no credentials and writes nothing directly. The MCP server is the narrow waist: every input is Zod-validated, fuzzy references are resolved in code (ambiguity is returned to the model as structured data, which then asks the user), deletes are soft, and past-time writes are rejected server-side regardless of what the prompt resolved. Every tool invocation is audited to a tool_invocations table (input, output, error, duration, model) — which doubles as the ground truth the eval harness scores against.

Bounded loop, bounded blast radius

The orchestration is ~50 lines of explicit code over the vanilla OpenAI SDK — no agent framework. The tool-call loop is capped (OPENAI_MAX_TOOL_ITERATIONS), MCP calls time out at 30s, per-user history is truncated pair-preservingly (SESSION_MAX_MESSAGES) so context can't grow unboundedly, and the Telegram allowlist rejects unknown users before any model call. Failure behavior is designed, not emergent: when information is complete the agent acts with zero confirmation round-trips; when it's missing the agent asks instead of inventing; timeouts produce an honest "couldn't reach the server" rather than a hallucinated success.

Traffic isolation

Every write is tagged with its origin (MCP_SOURCE: telegram / pwa / smoke / eval / import), and test traffic runs under dedicated user UUIDs — evals and smoke runs can never touch production calendar data, and eval artifacts are cleaned up by pnpm eval:cleanup.

Getting started

Prerequisites

  • Node.js ≥ 22, pnpm ≥ 9
  • A Supabase project (cloud — no local Postgres needed)
  • A Telegram bot token from @BotFather
  • An OpenAI API key (chat completions + Whisper)
  • Optional: Docker (for the containerized stack and the prompt-dev REPL), a Vercel account (for hosting the PWA)

1. Install

git clone <this repo>
cd calendar-assistant
pnpm install

2. Set up Supabase

  1. Create a Supabase project and note the project ref.
  2. Apply the migrations in order: open the Dashboard SQL Editor (https://supabase.com/dashboard/project/<your-ref>/sql/new) and run each file in supabase/migrations/ (they are timestamp-ordered). Alternatively, link the Supabase CLI (supabase init, supabase link --project-ref <ref>) and use pnpm db:push.
  3. Create your user: in Authentication, create a user (email magic link is what the PWA login uses). Copy the user's UUID — this is your USER_ID. All calendar rows are scoped to it.

3. Configure the environment

cp .env.example .env

Fill in .env at the repo root. Every variable is documented inline in .env.example; summary:

Variable Used by Purpose
SUPABASE_URL mcp-server, agent Project API URL (https://<ref>.supabase.co).
SUPABASE_ANON_KEY tests, PWA Public anon key.
SUPABASE_SERVICE_ROLE_KEY mcp-server Server-side key; bypasses RLS. Never expose to the PWA.
USER_ID mcp-server The auth user UUID all events belong to.
USER_TIMEZONE mcp-server, agent IANA timezone for natural-language time resolution (default Europe/Istanbul).
TELEGRAM_BOT_TOKEN agent Token from @BotFather.
TELEGRAM_ALLOW_USER_IDS agent Comma-separated numeric Telegram user IDs allowed to use the bot.
OPENAI_API_KEY agent Chat completions + Whisper transcription.
OPENAI_MODEL, OPENAI_TEMPERATURE, OPENAI_MAX_TOOL_ITERATIONS agent Model knobs (defaults: gpt-4.1-mini, 0.3, 8).
MCP_URL, MCP_REQUEST_TIMEOUT_MS agent MCP endpoint (default http://127.0.0.1:3001/mcp; compose overrides to http://mcp:3001/mcp).
SESSION_MAX_MESSAGES, LOG_LEVEL agent Chat-history cap per user; pino log level.
NEXT_PUBLIC_SUPABASE_URL, NEXT_PUBLIC_SUPABASE_ANON_KEY PWA Browser-safe mirrors of the Supabase URL/anon key.
NEXT_PUBLIC_VAPID_PUBLIC_KEY, VAPID_PRIVATE_KEY, VAPID_SUBJECT PWA Web push keys — generate with npx --yes web-push generate-vapid-keys --json.
ALLOW_CLOUD_TESTS tests Safety latch: integration tests refuse to run against a non-local SUPABASE_URL unless set to 1.
EVAL_* evals Eval harness mode, isolated API keys, judge toggle — see Evals.

The full per-variable reference for the agent lives in agent/README.md; PWA-specific setup (magic-link auth config, Vercel flow) lives in pwa/README.md.

4. Run the MCP server

pnpm --filter @personal-calendar/mcp-server build
node mcp-server/dist/transport/http.js   # HTTP transport on :3001

Verify it end-to-end without any LLM using the smoke CLI:

pnpm smoke list-tools
pnpm smoke call add_event '{"title":"test","start_at":"2026-06-05T18:00:00+03:00"}'
pnpm smoke call list_events '{"range_start":"2026-06-01T00:00:00+03:00","range_end":"2026-06-30T23:59:59+03:00"}'
pnpm smoke call delete_event '{"event_ref":"test","hard_delete":true}'

5. Run the Telegram agent

Local (node):

pnpm --filter @personal-calendar/agent build
node agent/dist/index.js

Or run both services as containers:

docker compose -f docker/docker-compose.local.yml up --build

Telegram allows exactly one long-poll client per bot token. If the stack also runs somewhere else (e.g. the production appliance), stop one before starting the other, or the bot becomes flaky with 409 conflicts.

Message your bot: "dentist tomorrow at 14:00" → Added: … ✓. Send a voice note saying the same thing — it goes through Whisper and lands in the same pipeline.

6. Run the PWA

pnpm dev          # next dev on :3000

Log in with the email of the Supabase auth user you created (magic link). Deploy with Vercel (pnpm vercel deploy — the CLI is a workspace devDependency); set the NEXT_PUBLIC_* and VAPID variables in the Vercel project. Details: pwa/README.md.

Telegram bot setup & customization

Create the bot. Talk to @BotFather/newbot → copy the token into TELEGRAM_BOT_TOKEN.

Restrict access. The agent hard-rejects anyone not in TELEGRAM_ALLOW_USER_IDS. Get your numeric ID by messaging a bot like @userinfobot, then set e.g. TELEGRAM_ALLOW_USER_IDS=123456789 (comma-separate multiple IDs).

Customize behavior. The agent's entire personality and decision policy live in one file: agent/src/system-prompt.ts. It defines the decision procedure (when to act vs. when to ask), time-resolution rules (e.g. a bare "at 3" is assumed to be 15:00, while "at 9" triggers an "9 am or 9 pm?" question), typo handling, and response formatting. The prompt is snapshot-tested: after editing it, run

pnpm --filter @personal-calendar/agent test -- -u   # review the snapshot diff deliberately

Iterate on the prompt without redeploying. pnpm chat starts a REPL that replicates the production pipeline (real OpenAI + a dedicated smoke MCP container on :3003 + real Supabase, scoped to a reserved smoke user and cleaned up on exit):

pnpm chat
#   /reload  re-read agent/src/system-prompt.ts (no rebuild)
#   /reset   clear conversation history
#   /quit    exit

# Batch mode for scripted scenarios
printf 'add dentist tomorrow at 14:00\nwhat do I have this week\n' | pnpm chat --json
pnpm chat --message "meeting Friday at 10:00" --json
pnpm chat --keep --message "..."   # reuse the container across runs
pnpm chat:down                     # stop it

Do not run pnpm chat and pnpm smoke:openai simultaneously — they share the smoke user, port 3003, and container name.

Tune the model. OPENAI_MODEL, OPENAI_TEMPERATURE, and OPENAI_MAX_TOOL_ITERATIONS are plain env vars — no code changes needed to try a different model.

Change the timezone. Set USER_TIMEZONE for the agent/MCP server. Note the system prompt's examples assume UTC+3 (Europe/Istanbul); if you move far away, adjust the prompt's datetime section to match.

Testing

pnpm -r test            # all workspaces (agent, mcp-server, pwa, evals)
pnpm run test:scripts   # script-layer unit tests
  • mcp-server integration tests run against your cloud Supabase and require ALLOW_CLOUD_TESTS=1 in .env (they use isolated fixtures and clean up after themselves).
  • agent tests are fully mocked (no network) and enforce coverage thresholds.
  • evals/tests/runner.test.ts and pnpm smoke:openai need Docker (they spin up an MCP container).

Evals

evals/ is a regression harness for agent behavior: 30 fixture conversations (evals/fixtures/regression-cases.json) covering add/list/update/delete/availability, typos, ambiguous hours, and past-time handling.

pnpm eval:ci        # mock mode — validates tool selection + arguments, no LLM cost
pnpm eval:cleanup   # remove eval artifacts from the database

Mock mode is the default (EVAL_MODE=mock) and is what CI runs (.github/workflows/eval.yml). Full mode (real LLM + Claude-based response judge, gated behind EVAL_OPENAI_API_KEY / ANTHROPIC_API_KEY / EVAL_JUDGE_ENABLED=1) is scaffolded but the judge client is intentionally still a stub.

Production deployment

The reference deployment runs 24/7 on a Jetson Nano under a systemd-managed Docker Compose stack:

  • agent (long-poll + OpenAI + MCP HTTP + Whisper) and mcp containers on a private bridge network.
  • Survives reboots (~200s from sudo reboot to fully active), restarts on crash, log rotation capped at 10MB×3 per container, weekly Docker prune.
  • Images are built natively on the device (arm64) — no registry required.

Setup history and runbook: deploy/jetson/README.md. Day-2 operations (health checks, recovery, backup posture): docs/OPERATIONS.md. Any Docker host works the same way via docker/docker-compose.local.yml — Jetson is just where this instance lives.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors