PaperDB

Faceted decomposition and embedding pipeline for structured research-paper retrieval. Ingests CS/ML papers from arXiv + Semantic Scholar, decomposes them into a fixed set of orthogonal facets via LLM extraction, and stores the result in PostgreSQL + pgvector with HNSW indexes on per-facet embeddings. Supports composable multi-axis queries that single-vector embedding search cannot express.

Status

Course project for NYU CDS Data Engineering, Spring 2026. See paperdb_proposal.pdf for the full design.

Architecture at a glance

arXiv + S2 + PDFs ──▶ staging_papers ──▶ classify(archetype)
                                    │
                                    ▼
                              extract facets (universal + archetype-specific)
                                    │
                                    ▼
                              embed per-facet text
                                    │
                                    ▼
                              load into papers + extension table
                                    │
                                    ▼
                              composable_search() / FastAPI / CLI

Universal facets live on the papers core table; archetype-specific facets live on one of empirical_ml_facets, theory_facets, systems_facets, dataset_facets. Each paper has exactly one extension row, determined by its primary archetype. See docs/architecture.md and docs/archetypes.md.

Why per-facet embeddings (the load-bearing design choice)

Existing paper-search systems take one of two approaches:

Keyword search (Google Scholar, PubMed) is brittle to paraphrasing — "transformer-based machine translation" and "encoder-decoder attention model for MT" miss each other entirely.
Single-vector embeddings (SPECTER, OpenAlex) embed the entire abstract into one vector. That vector is a blurry average of method, domain, theory, and compute — so a paper about transformers for vision and transformers for NLP land near each other because both contain "transformer," even though only one matches if your query is "vision."

PaperDB instead embeds each facet separately, into its own vector column with its own HNSW index:

method_summary_embedding lives in a space where similarity ≈ "same method"
datasets_embedding lives in a space where similarity ≈ "same evaluation data"
problem_statement_embedding captures problem similarity
proof_technique_embedding captures how a theorem is proved, separately from what is proved (main_result_embedding)
...and so on for every facet listed in docs/archetypes.md

This is what unlocks the proposal's headline query — "same method as paper X, but applied to a different domain":

from paperdb.query import FacetQuery, StructuredFilter, composable_search
from paperdb.models import Archetype

results = composable_search(
    queries=[
        FacetQuery(facet="method_summary", query_paper_id=X, weight=2.0),
    ],
    filt=StructuredFilter(
        archetype=Archetype.EMPIRICAL_ML,
        domain_any=["bio"],   # explicitly *not* X's domain
    ),
    k=20,
)

A single-vector index physically cannot separate those two axes — you would be searching one blurry blob and hoping. The lift this buys over the single-vector baseline is exactly what paperdb eval-retrieval measures.

Capability matrix

Capability	Keyword (BM25)	Single-vector (SPECTER)	PaperDB per-facet
Find papers about transformers	✓	✓	✓
Find papers about transformers in vision	brittle	✓ (blurry)	✓ (precise)
Same method as paper X, different domain	✗	✗	✓
Theory papers using PAC-Bayes proofs	partial	✗	✓ (`proof_technique_embedding` + filter)
Combine vector similarity with `WHERE year > 2022`	✗	✗ (separate index)	✓ (one SQL query)
Sub-100ms latency at 100k papers	✓	✓	✓ (HNSW per facet)

Where embeddings show up in the code

Embeddings appear at four points in the pipeline:

Compute — embed/backends.py wraps BGE-M3 (local, default) or OpenAI text-embedding-3-*. Both produce L2-normalized 1024-dim vectors so cosine distance behaves well. embed/facet_embeddings.py:embed_extracted runs each non-null facet text through the embedder and returns {column_name: vector}.
Store — every embedding column is vector(1024) in src/paperdb/sql/, wrapped in an HNSW index with cosine distance (USING hnsw (col vector_cosine_ops) WITH (m = 16, ef_construction = 64)).
Query — query/engine.py holds a FACET_REGISTRY mapping facet names → (table, embedding_column, archetype). Each FacetQuery resolves to one ANN scan against one HNSW index. The <=> operator is pgvector's cosine distance; 1 - (col <=> qvec) is the similarity the engine ranks by.
Evaluate — evaluate/facet_quality.py uses cosine similarity of embeddings to score free-text facets like problem_statement or method_summary, since string equality is a poor metric there.

Embed vs SQL: not every facet gets a vector

The schema embeds only facets where semantic similarity is meaningful as a query target — method_summary yes, code_url no; datasets yes (you want fuzzy matching across "MS-MARCO" / "MS MARCO"), metrics no (you would query that with SQL: WHERE metric->>'value' > 0.85). That split — embed-and-ANN vs store-and-SQL-filter — is the point of the relational + vector hybrid: embeddings for fuzzy semantic axes, SQL for crisp structured ones. Most queries use both at once via composable_search(queries=[...], filt=...).

Quickstart

# 1. Bring up Postgres + pgvector
docker compose up -d

# 2. Install
pip install -e ".[embeddings,dev]"

# 3. Configure
cp .env.example .env  # then edit

# 4. Apply schema
paperdb init-db

# 5. End-to-end on a small batch
paperdb run --n 20

# 6. Query
paperdb search "method_summary=mixture of experts" "datasets=ImageNet" --k 10
paperdb search "abstract=retrieval augmented generation" --archetype empirical_ml --domain nlp

# 7. Inspect
paperdb stats
paperdb facets

The Anthropic API key in .env is required for the LLM extraction. Embeddings default to local BGE-M3 via sentence-transformers; switch to OpenAI by setting PAPERDB_EMBEDDING_BACKEND=openai.

Run the API

uvicorn paperdb.api:app --reload
# http://localhost:8000/docs

Evaluation

# Per-facet F1 against a hand-labeled validation set
paperdb eval --validation-set data/eval/validation.jsonl

# Faceted retrieval vs single-embedding baseline
paperdb eval-retrieval --benchmark data/eval/retrieval.jsonl --k 10

# Method x domain sparsity (gap detection)
paperdb gaps --min-count 2

Project layout

src/paperdb/
  config.py             # pydantic-settings
  db.py                 # connection pool, schema init
  models.py             # pydantic models for facets
  ingest/               # arXiv, S2, PDF
  extract/              # archetype classifier + 2-pass facet extraction + LLM client
  embed/                # BGE / OpenAI backends + per-facet embedding
  load/                 # writes into papers + extension tables
  flows/                # Prefect DAGs
  query/                # composable search engine
  evaluate/             # F1 per facet, precision@k vs baseline
  gaps.py               # method x domain sparsity
  api.py                # FastAPI
  cli.py                # typer CLI
  prompts/              # markdown prompts (versioned)
  sql/                  # DDL files applied by init-db
docs/
  archetypes.md         # taxonomy + per-archetype facet lists
  architecture.md       # data flow and design decisions
  data_dictionary.md    # every column, every table
notebooks/
  gaps_and_distributions.ipynb
tests/

Milestones

See the proposal for the full 9-week plan. The repo currently implements Weeks 1-7: schema + ingestion + extraction + embedding + loading + composable query + evaluation. Week 8 (gap detection) and Week 9 (documentation polish) are in gaps.py and docs/ respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
data		data
docs		docs
notebooks		notebooks
scripts		scripts
src/paperdb		src/paperdb
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaperDB

Status

Architecture at a glance

Why per-facet embeddings (the load-bearing design choice)

Capability matrix

Where embeddings show up in the code

Embed vs SQL: not every facet gets a vector

Quickstart

Run the API

Evaluation

Project layout

Milestones

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PaperDB

Status

Architecture at a glance

Why per-facet embeddings (the load-bearing design choice)

Capability matrix

Where embeddings show up in the code

Embed vs SQL: not every facet gets a vector

Quickstart

Run the API

Evaluation

Project layout

Milestones

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages