Skip to content

BasedHardware/omi-memory-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Omi Memory Benchmarks

How good is Omi's memory? We measured it on the standard long-term-memory benchmarks — LoCoMo, LongMemEval, and BEAM — using the same evaluation protocol the memory vendors use, and published everything here: scores, per-question results, and the exact harness code.

Headline results

results

LoCoMo — full benchmark (all 10 conversations, 1,540 questions)

System Score Source
Zep 58.4% corrected replication
mem0 (paper, reproducible) 66.9% mem0 paper, Table 1
Full-context baseline 72.9% mem0 paper
Letta (filesystem) 74.0% Letta blog
Omi 86.6% this repo, results/locomo_full_1540.json
mem0 Platform 92.5% vendor-reported, closed platform

Per category (Omi): single-hop 90.5% · multi-hop 85.1% · temporal 84.7% · open-domain 63.5%.

LongMemEval (oracle variant, stratified 36-question subset)

83.3% overall — single-session-user 100% · single-session-assistant 100% · knowledge-update 83% · multi-session 83% · temporal-reasoning 83% · preference 50%. results/longmemeval_oracle_36.json

BEAM-100K (40 questions, rubric grading)

55–62.5% depending on answering mode (62.5% via Omi's conversational agent, 55% via the short-answer QA head). Reference: BEAM's LIGHT baseline scores ~34% at 1M scale; SOTA (Hindsight) ~73% at 100K. results/beam_100k_40.json

How it was measured

The protocol mirrors how mem0/Zep/supermemory publish their numbers (a thin QA head over the memory system's retrieval API — not the consumer chat UI):

  1. Ingestion — every benchmark conversation is pushed through Omi's real open-source pipeline (process_conversation): LLM structuring, memory extraction, conversation vectors, and Omi's verbatim transcript-chunk index (BasedHardware/omi#7832).
  2. Retrieval — Omi's REST API: POST /v1/tools/memories/search (k=20) + POST /v1/tools/conversations/search (k=15) + POST /v1/tools/conversations/search-chunks (k=20, merged across 3 query reformulations).
  3. Answeringgpt-4.1, temp 0, with mem0's official LoCoMo answer prompt (≤5-6-word answers, relative→absolute date conversion), verbatim from mem0ai/mem0/evaluation/prompts.py.
  4. Judginggpt-4o-mini, temp 0, with mem0's verbatim ACCURACY_PROMPT LLM-judge (the de-facto standard all vendors report with). LoCoMo category 5 (adversarial) excluded, exactly as mem0's own eval code does.

Everything is in harness/: the QA head, judge, and ingestion scripts. Results JSONs contain every question, gold answer, generated answer, and verdict.

Honest caveats (read before quoting)

  • Vendor numbers vary by protocol. mem0's 92.5% is their closed managed platform (top_k=200, vendor-reported); their reproducible paper number is 66.9%. supermemory's ~90%+ claims are on LongMemEval (answer accuracy) — different benchmark than LoCoMo.
  • Ingestion granularity. Omi ingests per-session conversations (its native unit); mem0/Zep ingest per message-pair. Per-session retrieval is somewhat easier.
  • Dev/held-out split. Retrieval/prompt settings were tuned on a 75-question subset (89.3% there) and validated frozen on 120 unseen questions (77.5%) before the full run (86.6%). The four conversations never touched during tuning score 83–89%.
  • LongMemEval is the oracle variant (evidence sessions only, no distractor haystack) on a 36-question stratified subset — not the full _s 500-question haystack. The _s number would be lower; treat 83.3% as an upper-bound-style measurement.
  • Single run, temp 0 (a repeat run on the tuning subset reproduced the score exactly).
  • Judge leniency matters ±15–25pts across papers; we used mem0's lenient judge verbatim so numbers line up with their column.

What made the difference

Omi historically embedded only conversation summaries, so exact details (dates, names, numbers) were semantically unfindable. The single biggest lever was indexing verbatim transcript chunks (dated overlapping windows of the raw transcript) — shipped flag-gated in omi#7832. On the same protocol, LoCoMo went ~51% → 86.6% and LongMemEval 53% → 83.3% with that layer plus a stronger answerer and multi-query retrieval. Ingestion cost is embeddings-only: ~$0.01/user/month for a heavy voice user.

Datasets

  • LoCoMo (Snap Research, CC BY-NC 4.0)
  • LongMemEval (Wu et al., ICLR 2025)
  • BEAM ("Beyond a Million Tokens", ICLR 2026)

Result files quote benchmark questions/answers for verifiability under the datasets' research licenses; no Omi user data is included (all runs used synthetic benchmark accounts).

About

Omi long-term memory benchmarks: LoCoMo 86.6% (full 1,540 questions), LongMemEval 83.3% — full results + eval harness

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages