Omi Memory Benchmarks

How good is Omi's memory? We measured it on the standard long-term-memory benchmarks — LoCoMo, LongMemEval, and BEAM — using the same evaluation protocol the memory vendors use, and published everything here: scores, per-question results, and the exact harness code.

Headline results

LoCoMo — full benchmark (all 10 conversations, 1,540 questions)

System	Score	Source
Zep	58.4%	corrected replication
mem0 (paper, reproducible)	66.9%	mem0 paper, Table 1
Full-context baseline	72.9%	mem0 paper
Letta (filesystem)	74.0%	Letta blog
Omi	86.6%	this repo, `results/locomo_full_1540.json`
mem0 Platform	92.5%	vendor-reported, closed platform

Per category (Omi): single-hop 90.5% · multi-hop 85.1% · temporal 84.7% · open-domain 63.5%.

LongMemEval (oracle variant, stratified 36-question subset)

83.3% overall — single-session-user 100% · single-session-assistant 100% · knowledge-update 83% · multi-session 83% · temporal-reasoning 83% · preference 50%. results/longmemeval_oracle_36.json

BEAM-100K (40 questions, rubric grading)

55–62.5% depending on answering mode (62.5% via Omi's conversational agent, 55% via the short-answer QA head). Reference: BEAM's LIGHT baseline scores ~34% at 1M scale; SOTA (Hindsight) ~73% at 100K. results/beam_100k_40.json

How it was measured

The protocol mirrors how mem0/Zep/supermemory publish their numbers (a thin QA head over the memory system's retrieval API — not the consumer chat UI):

Ingestion — every benchmark conversation is pushed through Omi's real open-source pipeline (process_conversation): LLM structuring, memory extraction, conversation vectors, and Omi's verbatim transcript-chunk index (BasedHardware/omi#7832).
Retrieval — Omi's REST API: POST /v1/tools/memories/search (k=20) + POST /v1/tools/conversations/search (k=15) + POST /v1/tools/conversations/search-chunks (k=20, merged across 3 query reformulations).
Answering — gpt-4.1, temp 0, with mem0's official LoCoMo answer prompt (≤5-6-word answers, relative→absolute date conversion), verbatim from mem0ai/mem0/evaluation/prompts.py.
Judging — gpt-4o-mini, temp 0, with mem0's verbatim ACCURACY_PROMPT LLM-judge (the de-facto standard all vendors report with). LoCoMo category 5 (adversarial) excluded, exactly as mem0's own eval code does.

Everything is in harness/: the QA head, judge, and ingestion scripts. Results JSONs contain every question, gold answer, generated answer, and verdict.

Honest caveats (read before quoting)

Vendor numbers vary by protocol. mem0's 92.5% is their closed managed platform (top_k=200, vendor-reported); their reproducible paper number is 66.9%. supermemory's ~90%+ claims are on LongMemEval (answer accuracy) — different benchmark than LoCoMo.
Ingestion granularity. Omi ingests per-session conversations (its native unit); mem0/Zep ingest per message-pair. Per-session retrieval is somewhat easier.
Dev/held-out split. Retrieval/prompt settings were tuned on a 75-question subset (89.3% there) and validated frozen on 120 unseen questions (77.5%) before the full run (86.6%). The four conversations never touched during tuning score 83–89%.
LongMemEval is the oracle variant (evidence sessions only, no distractor haystack) on a 36-question stratified subset — not the full _s 500-question haystack. The _s number would be lower; treat 83.3% as an upper-bound-style measurement.
Single run, temp 0 (a repeat run on the tuning subset reproduced the score exactly).
Judge leniency matters ±15–25pts across papers; we used mem0's lenient judge verbatim so numbers line up with their column.

What made the difference

Omi historically embedded only conversation summaries, so exact details (dates, names, numbers) were semantically unfindable. The single biggest lever was indexing verbatim transcript chunks (dated overlapping windows of the raw transcript) — shipped flag-gated in omi#7832. On the same protocol, LoCoMo went ~51% → 86.6% and LongMemEval 53% → 83.3% with that layer plus a stronger answerer and multi-query retrieval. Ingestion cost is embeddings-only: ~$0.01/user/month for a heavy voice user.

Datasets

LoCoMo (Snap Research, CC BY-NC 4.0)
LongMemEval (Wu et al., ICLR 2025)
BEAM ("Beyond a Million Tokens", ICLR 2026)

Result files quote benchmark questions/answers for verifiability under the datasets' research licenses; no Omi user data is included (all runs used synthetic benchmark accounts).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
harness		harness
results		results
LICENSE		LICENSE
README.md		README.md
chart.svg		chart.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Omi Memory Benchmarks

Headline results

LoCoMo — full benchmark (all 10 conversations, 1,540 questions)

LongMemEval (oracle variant, stratified 36-question subset)

BEAM-100K (40 questions, rubric grading)

How it was measured

Honest caveats (read before quoting)

What made the difference

Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Omi Memory Benchmarks

Headline results

LoCoMo — full benchmark (all 10 conversations, 1,540 questions)

LongMemEval (oracle variant, stratified 36-question subset)

BEAM-100K (40 questions, rubric grading)

How it was measured

Honest caveats (read before quoting)

What made the difference

Datasets

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages