How good is Omi's memory? We measured it on the standard long-term-memory benchmarks — LoCoMo, LongMemEval, and BEAM — using the same evaluation protocol the memory vendors use, and published everything here: scores, per-question results, and the exact harness code.
| System | Score | Source |
|---|---|---|
| Zep | 58.4% | corrected replication |
| mem0 (paper, reproducible) | 66.9% | mem0 paper, Table 1 |
| Full-context baseline | 72.9% | mem0 paper |
| Letta (filesystem) | 74.0% | Letta blog |
| Omi | 86.6% | this repo, results/locomo_full_1540.json |
| mem0 Platform | 92.5% | vendor-reported, closed platform |
Per category (Omi): single-hop 90.5% · multi-hop 85.1% · temporal 84.7% · open-domain 63.5%.
83.3% overall — single-session-user 100% · single-session-assistant 100% ·
knowledge-update 83% · multi-session 83% · temporal-reasoning 83% · preference 50%.
results/longmemeval_oracle_36.json
55–62.5% depending on answering mode (62.5% via Omi's conversational agent, 55% via the
short-answer QA head). Reference: BEAM's LIGHT baseline scores ~34% at 1M scale; SOTA
(Hindsight) ~73% at 100K. results/beam_100k_40.json
The protocol mirrors how mem0/Zep/supermemory publish their numbers (a thin QA head over the memory system's retrieval API — not the consumer chat UI):
- Ingestion — every benchmark conversation is pushed through Omi's real open-source
pipeline (
process_conversation): LLM structuring, memory extraction, conversation vectors, and Omi's verbatim transcript-chunk index (BasedHardware/omi#7832). - Retrieval — Omi's REST API:
POST /v1/tools/memories/search(k=20) +POST /v1/tools/conversations/search(k=15) +POST /v1/tools/conversations/search-chunks(k=20, merged across 3 query reformulations). - Answering —
gpt-4.1, temp 0, with mem0's official LoCoMo answer prompt (≤5-6-word answers, relative→absolute date conversion), verbatim frommem0ai/mem0/evaluation/prompts.py. - Judging —
gpt-4o-mini, temp 0, with mem0's verbatimACCURACY_PROMPTLLM-judge (the de-facto standard all vendors report with). LoCoMo category 5 (adversarial) excluded, exactly as mem0's own eval code does.
Everything is in harness/: the QA head, judge, and ingestion scripts. Results
JSONs contain every question, gold answer, generated answer, and verdict.
- Vendor numbers vary by protocol. mem0's 92.5% is their closed managed platform (top_k=200, vendor-reported); their reproducible paper number is 66.9%. supermemory's ~90%+ claims are on LongMemEval (answer accuracy) — different benchmark than LoCoMo.
- Ingestion granularity. Omi ingests per-session conversations (its native unit); mem0/Zep ingest per message-pair. Per-session retrieval is somewhat easier.
- Dev/held-out split. Retrieval/prompt settings were tuned on a 75-question subset (89.3% there) and validated frozen on 120 unseen questions (77.5%) before the full run (86.6%). The four conversations never touched during tuning score 83–89%.
- LongMemEval is the oracle variant (evidence sessions only, no distractor haystack) on a
36-question stratified subset — not the full
_s500-question haystack. The_snumber would be lower; treat 83.3% as an upper-bound-style measurement. - Single run, temp 0 (a repeat run on the tuning subset reproduced the score exactly).
- Judge leniency matters ±15–25pts across papers; we used mem0's lenient judge verbatim so numbers line up with their column.
Omi historically embedded only conversation summaries, so exact details (dates, names, numbers) were semantically unfindable. The single biggest lever was indexing verbatim transcript chunks (dated overlapping windows of the raw transcript) — shipped flag-gated in omi#7832. On the same protocol, LoCoMo went ~51% → 86.6% and LongMemEval 53% → 83.3% with that layer plus a stronger answerer and multi-query retrieval. Ingestion cost is embeddings-only: ~$0.01/user/month for a heavy voice user.
- LoCoMo (Snap Research, CC BY-NC 4.0)
- LongMemEval (Wu et al., ICLR 2025)
- BEAM ("Beyond a Million Tokens", ICLR 2026)
Result files quote benchmark questions/answers for verifiability under the datasets' research licenses; no Omi user data is included (all runs used synthetic benchmark accounts).