feat(memory): recall and store memory around chat completions#190
feat(memory): recall and store memory around chat completions#190dpoulopoulos wants to merge 3 commits into
Conversation
WalkthroughAdds opt-in persistent memory to the gateway's platform-mode chat completions path. A new ChangesPlatform Memory Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/unit/test_memory_platform.py (1)
25-80: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winAdd regression tests for malformed timeout config values.
Please add tests where
memory_recall_timeout_ms/memory_remember_timeout_msare non-numeric, so best-effort semantics stay protected against config drift.Also applies to: 86-120
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_memory_platform.py` around lines 25 - 80, Add new regression test cases to verify that non-numeric values for `memory_recall_timeout_ms` and `memory_remember_timeout_ms` configuration parameters are handled gracefully. Create test functions that pass malformed timeout values (such as strings or None) through the _config() helper function and verify that the _recall_platform_memory function still executes successfully without raising exceptions, following the same pattern as the existing test functions like test_recall_non_200_yields_no_facts and test_recall_timeout_is_swallowed to ensure best-effort semantics are protected against configuration drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/gateway/api/routes/_pipeline.py`:
- Around line 1461-1462: The asyncio.create_task call for on_settled_text is
fire-and-forget and lacks exception handling, which causes unhandled task
exceptions to leak runtime warnings. Add a done callback to the task created by
asyncio.create_task that suppresses any exceptions raised by on_settled_text,
ensuring memory persistence failures remain silent and non-disruptive. Reference
the asyncio.create_task line with on_settled_text and add a callback using
add_done_callback that catches and silently ignores any exceptions from the
completed task.
In `@src/gateway/api/routes/_platform.py`:
- Around line 602-603: The timeout parsing on line 602 and line 639 using
int(config.platform.get("memory_recall_timeout_ms", 2000)) can raise ValueError
exceptions if the config value is not a valid integer, breaking the "never block
chat" contract. For both occurrences where memory_recall_timeout_ms is parsed,
wrap the int() conversion in a try-except block to safely handle invalid input,
catching ValueError and returning the default fallback value of 2000 when
parsing fails. This ensures misconfiguration does not cause chat functionality
to break.
In `@src/gateway/api/routes/chat.py`:
- Around line 361-364: The recall operation in the memory check block lacks
error handling, which means any exception raised by _recall_platform_memory
(such as invalid timeout parsing in its config handling) will cause the entire
request to fail. Wrap the _recall_platform_memory call and the subsequent
inject_memory_facts call in a try-except block so that any exceptions are caught
and logged, allowing the request to proceed without memory injection rather than
breaking the completion entirely. This ensures memory recall failures remain
best-effort and do not disrupt provider dispatch.
---
Nitpick comments:
In `@tests/unit/test_memory_platform.py`:
- Around line 25-80: Add new regression test cases to verify that non-numeric
values for `memory_recall_timeout_ms` and `memory_remember_timeout_ms`
configuration parameters are handled gracefully. Create test functions that pass
malformed timeout values (such as strings or None) through the _config() helper
function and verify that the _recall_platform_memory function still executes
successfully without raising exceptions, following the same pattern as the
existing test functions like test_recall_non_200_yields_no_facts and
test_recall_timeout_is_swallowed to ensure best-effort semantics are protected
against configuration drift.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: eee16f2c-1ddb-4b1b-a076-e50f57f07112
📒 Files selected for processing (8)
src/gateway/api/routes/_helpers.pysrc/gateway/api/routes/_pipeline.pysrc/gateway/api/routes/_platform.pysrc/gateway/api/routes/chat.pysrc/gateway/core/config.pytests/unit/test_memory_inject.pytests/unit/test_memory_platform.pytests/unit/test_stream_memory_capture.py
dpoulopoulos
left a comment
There was a problem hiding this comment.
Adds opt-in persistent memory to platform-mode chat completions: a new memory_enabled flag (off by default) gates a best-effort recall-before-dispatch + remember-after flow. Message helpers inject recalled facts into the system prompt and build the turn to store; two platform clients (_recall_platform_memory, _remember_platform_memory) talk to /gateway/memory/{recall,remember}; a streaming wrapper accumulates assistant text and fires a fire-and-forget remember on normal completion. Both streaming and non-streaming paths are wired. Standalone mode is unaffected.
0aee46f to
f98dbd5
Compare
|
Thanks for the careful review. Pushed fixups addressing the best-effort gaps:
Two points I left as-is, with reasoning:
ruff, mypy (strict), and the full unit suite are green. |
dpoulopoulos
left a comment
There was a problem hiding this comment.
Adds opt-in, best-effort persistent memory to platform-mode chat completions, gated by a new memory_enabled config flag (OTARI_MEMORY_ENABLED, default off). Before dispatch the gateway recalls facts from the platform (/gateway/memory/recall) and injects them into the system prompt; after a completion it stores the new turn fire-and-forget (/gateway/memory/remember) for both non-streaming (background task) and streaming (a capture wrapper that accumulates assistant text and persists only on normal completion). Standalone mode and the /messages and /responses endpoints are unaffected. This is a re-review after the author addressed the previous round of feedback.
Prior feedback status: resolved. All actionable items from the previous review (CodeRabbit's 3 plus the prior [major]/[minor] review's 5) are addressed in the current code: the narrow (TimeoutException, NetworkError) catch was broadened to httpx.HTTPError in both recall and remember; a _coerce_timeout_ms helper guards timeout parsing; the recall call site in chat.py is wrapped in a best-effort try/except; the streaming create_task now has an exception-swallowing done-callback; and regression tests for malformed timeouts, generic httpx errors, and a raising settled-callback were added. Two low-priority items the prior review explicitly flagged as by-design/low-risk remain open (recall latency, multimodal-system-content edge case). Verified: ruff, mypy (strict), and all 32 new unit tests pass. No web UI exists for this backend feature, so Playwright verification is not applicable.
| memory_user_token = ctx.user_token if (config.memory_enabled and ctx.platform_mode) else None | ||
| if memory_user_token: | ||
| try: | ||
| recalled = await _recall_platform_memory(config, memory_user_token, latest_user_text(request.messages)) |
There was a problem hiding this comment.
[minor] Recall is awaited synchronously before dispatch, so under a slow/degraded memory service it adds up to memory_recall_timeout_ms (default 2000ms) to time-to-first-token on every memory-enabled platform request. This is inherent to recall-before-dispatch (the facts must be in the prompt before the call) and is a reasonable trade-off, but the operator-facing latency budget isn't documented anywhere: memory_recall_timeout_ms / memory_remember_timeout_ms are read from the config.platform dict but aren't surfaced in the memory_enabled field description, config.example.yml, or the docs. Worth documenting the latency budget (and perhaps a more aggressive recall default than 2s) so operators understand the cost of enabling it. Carried over from the prior review; non-blocking.
| # Memory recall (platform mode, best-effort). Inject recalled facts into the system | ||
| # message before dispatch so every downstream path (stream/non-stream, tool/no-tool) | ||
| # sees them. A failure or a disabled workspace yields no facts and never blocks chat. | ||
| memory_user_token = ctx.user_token if (config.memory_enabled and ctx.platform_mode) else None |
There was a problem hiding this comment.
[minor] No integration test exercises the wired memory flow in chat_completions. The unit tests cover the helpers, the platform client, and the streaming-capture wrapper in isolation, but the actual wiring here, gating on config.memory_enabled and ctx.platform_mode, recall then inject_memory_facts, the best-effort try/except, _schedule_memory_remember on the non-streaming path (line 474), and on_memory_settled on the streaming path (line 405), is untested end to end. An integration test in tests/integration/test_platform_mode_chat.py (memory off = no platform memory calls; memory on = recall injected and remember fired) would lock in this behavior and catch wiring regressions the unit tests can't see.
| out = list(messages) | ||
| if out and isinstance(out[0], dict) and out[0].get("role") == "system": | ||
| existing = out[0].get("content") or "" | ||
| out[0] = {**out[0], "content": f"{existing}\n\n{block}" if existing else block} |
There was a problem hiding this comment.
[nit] When the first message is a system message whose content is a list of content parts (a multimodal system prompt), f"{existing}\n\n{block}" stringifies the list via its Python repr, corrupting the system content. System content is almost always a plain string, and this exactly mirrors the existing inject_purpose_hints helper (which has the same edge case), so it's low-risk and consistent, just flagging the shared gap. Carried over from the prior review.
| completed = True | ||
| finally: | ||
| if completed and parts: | ||
| task = asyncio.create_task(on_settled_text("".join(parts))) |
There was a problem hiding this comment.
[nit] asyncio.create_task(...) keeps no strong reference; only add_done_callback is attached. Per the asyncio docs the event loop holds only a weak reference to tasks, so a detached task can in theory be garbage-collected before it finishes. In practice the remember runs on the next loop tick and this mirrors the existing _report_platform_usage create_task pattern, so it's consistent with the codebase and low-risk. The exception-leak concern from the prior review is fully resolved by _swallow_memory_task_exception; this is just the remaining GC subtlety. Optional: hold the task in a set and discard it in the done-callback.
A master switch (default off) for persistent memory in platform mode. When off, the gateway makes no memory calls and behavior is unchanged; per-workspace enablement is still controlled by the platform. Refs #189 Signed-off-by: Dimitris Poulopoulos <dimitris.a.poulopoulos@gmail.com>
…pers _recall_platform_memory / _remember_platform_memory call the platform's /gateway/memory/* endpoints, modeled on the usage reporter: they swallow timeouts and errors so a slow or unavailable memory service never breaks a chat. inject_memory_facts prepends recalled facts to the system message (mirroring inject_purpose_hints); build_remember_messages assembles the minimal new turn to store. Helpers are unit-tested and not yet wired into the request path. Refs #189 Signed-off-by: Dimitris Poulopoulos <dimitris.a.poulopoulos@gmail.com>
Gated by OTARI_MEMORY_ENABLED in platform mode: recall runs once before dispatch and injects facts into the system message (covering every path, streaming and not), and the new turn is stored fire-and-forget afterwards. Non-streaming uses a background task; streaming uses _stream_with_memory_capture to accumulate the assistant text and remember only on normal completion (skipped on disconnect). The capture wrapper is unit-tested. No behavior change when the flag is off. Closes #189 Signed-off-by: Dimitris Poulopoulos <dimitris.a.poulopoulos@gmail.com>
f98dbd5 to
7c5fb36
Compare
|
Pushed fixups for the second-round feedback (folded into their original commits):
Left as-is (both flagged low-risk/by-design in review):
ruff, mypy (strict), and the memory + config test suites are green. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/gateway/core/config.py`:
- Around line 410-415: The PLATFORM_MEMORY_RECALL_TIMEOUT_MS and
PLATFORM_MEMORY_REMEMBER_TIMEOUT_MS configuration entries are registered with
int casting, which causes invalid environment values to raise an exception
during config load and prevent the best-effort fallback behavior in
_coerce_timeout_ms from working. Change the type casting for these two entries
from int to str to defer validation, allowing the configuration to load
successfully and let _coerce_timeout_ms handle the validation and fallback to
defaults when encountering non-numeric values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 17d7ddae-5fb9-4684-b2c5-6dfddb10ecd6
📒 Files selected for processing (10)
src/gateway/api/routes/_helpers.pysrc/gateway/api/routes/_pipeline.pysrc/gateway/api/routes/_platform.pysrc/gateway/api/routes/chat.pysrc/gateway/core/config.pytests/integration/test_config_env_loading.pytests/integration/test_platform_mode_chat.pytests/unit/test_memory_inject.pytests/unit/test_memory_platform.pytests/unit/test_stream_memory_capture.py
🚧 Files skipped from review as they are similar to previous changes (6)
- tests/unit/test_memory_platform.py
- tests/unit/test_memory_inject.py
- src/gateway/api/routes/_helpers.py
- src/gateway/api/routes/_platform.py
- src/gateway/api/routes/chat.py
- src/gateway/api/routes/_pipeline.py
| # Best-effort memory timeouts (platform mode). Recall is on the hot path | ||
| # (added to time-to-first-token); remember is fire-and-forget after the | ||
| # completion. Both fall back to their defaults on a missing or invalid | ||
| # value (see _coerce_timeout_ms in routes/_platform.py). | ||
| "PLATFORM_MEMORY_RECALL_TIMEOUT_MS": ("memory_recall_timeout_ms", int), | ||
| "PLATFORM_MEMORY_REMEMBER_TIMEOUT_MS": ("memory_remember_timeout_ms", int), |
There was a problem hiding this comment.
Handle invalid PLATFORM_MEMORY_*_TIMEOUT_MS values without crashing startup.
These vars are registered with int casting here, so a non-numeric env value raises during config load (Line 431) and prevents startup. That bypasses the intended best-effort fallback behavior in _coerce_timeout_ms(...).
Suggested fix
@@
- for env_name, (field_name, caster) in env_mappings.items():
+ memory_timeout_envs = {
+ "PLATFORM_MEMORY_RECALL_TIMEOUT_MS",
+ "PLATFORM_MEMORY_REMEMBER_TIMEOUT_MS",
+ }
+ for env_name, (field_name, caster) in env_mappings.items():
value = os.getenv(env_name)
if value is None or value == "":
continue
- platform[field_name] = caster(value)
+ try:
+ platform[field_name] = caster(value)
+ except ValueError:
+ if env_name in memory_timeout_envs:
+ # Let downstream _coerce_timeout_ms apply defaults.
+ platform[field_name] = value
+ continue
+ raise🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/gateway/core/config.py` around lines 410 - 415, The
PLATFORM_MEMORY_RECALL_TIMEOUT_MS and PLATFORM_MEMORY_REMEMBER_TIMEOUT_MS
configuration entries are registered with int casting, which causes invalid
environment values to raise an exception during config load and prevent the
best-effort fallback behavior in _coerce_timeout_ms from working. Change the
type casting for these two entries from int to str to defer validation, allowing
the configuration to load successfully and let _coerce_timeout_ms handle the
validation and fallback to defaults when encountering non-numeric values.
Description
Wires persistent memory into chat completions in platform mode: the gateway recalls relevant remembered context before the model answers and stores durable new facts afterward. It is off by default (
OTARI_MEMORY_ENABLED), best-effort, and covers both streaming and non-streaming completions.The point is that assistants built on the gateway feel continuous across sessions instead of stateless. The platform owns configuration, storage, privacy, and accounting (companion PR
mozilla-ai/otari-ai#1148); the gateway just brackets each completion with recall + remember.What's in it:
OTARI_MEMORY_ENABLEDmaster switch (default off → no behavior change for existing deployments).No memory behavior or added latency in standalone mode.
PR Type
Relevant issues
Closes #189
Checklist
tests/unit,tests/integration).make lint,make typecheck,make test).uv run python scripts/generate_openapi.py). (No contract change: memory is internal, the spec is unchanged.)AI Usage
AI Model/Tool used: Claude Code (Claude Opus 4.8)
Any additional AI details you'd like to share:
The memory integration was implemented end-to-end with Claude Code; the human author directed the work and reviews the result.
Overview
This pull request adds persistent memory functionality to the gateway's chat completion endpoints when operating in platform mode. The feature enables assistants to recall relevant facts from prior conversations and store new information for future sessions, improving the user experience by eliminating the need to repeat preferences and context across separate conversations.
What Changed
New Capability: Persistent Conversation Memory
Configuration & Control
OTARI_MEMORY_ENABLEDto control the feature globally (defaults to off)PLATFORM_MEMORY_RECALL_TIMEOUT_MS(2000ms default) andPLATFORM_MEMORY_REMEMBER_TIMEOUT_MS(10000ms default)Implementation Details
Code Changes
Benefits
Testing
Comprehensive unit and integration tests verify: