Skip to content

sunnydigital/paperdb

Repository files navigation

PaperDB

Faceted decomposition and embedding pipeline for structured research-paper retrieval. Ingests CS/ML papers from arXiv + Semantic Scholar, decomposes them into a fixed set of orthogonal facets via LLM extraction, and stores the result in PostgreSQL + pgvector with HNSW indexes on per-facet embeddings. Supports composable multi-axis queries that single-vector embedding search cannot express.

Status

Course project for NYU CDS Data Engineering, Spring 2026. See paperdb_proposal.pdf for the full design.

Architecture at a glance

arXiv + S2 + PDFs ──▶ staging_papers ──▶ classify(archetype)
                                    │
                                    ▼
                              extract facets (universal + archetype-specific)
                                    │
                                    ▼
                              embed per-facet text
                                    │
                                    ▼
                              load into papers + extension table
                                    │
                                    ▼
                              composable_search() / FastAPI / CLI

Universal facets live on the papers core table; archetype-specific facets live on one of empirical_ml_facets, theory_facets, systems_facets, dataset_facets. Each paper has exactly one extension row, determined by its primary archetype. See docs/architecture.md and docs/archetypes.md.

Why per-facet embeddings (the load-bearing design choice)

Existing paper-search systems take one of two approaches:

  • Keyword search (Google Scholar, PubMed) is brittle to paraphrasing — "transformer-based machine translation" and "encoder-decoder attention model for MT" miss each other entirely.
  • Single-vector embeddings (SPECTER, OpenAlex) embed the entire abstract into one vector. That vector is a blurry average of method, domain, theory, and compute — so a paper about transformers for vision and transformers for NLP land near each other because both contain "transformer," even though only one matches if your query is "vision."

PaperDB instead embeds each facet separately, into its own vector column with its own HNSW index:

  • method_summary_embedding lives in a space where similarity ≈ "same method"
  • datasets_embedding lives in a space where similarity ≈ "same evaluation data"
  • problem_statement_embedding captures problem similarity
  • proof_technique_embedding captures how a theorem is proved, separately from what is proved (main_result_embedding)
  • ...and so on for every facet listed in docs/archetypes.md

This is what unlocks the proposal's headline query — "same method as paper X, but applied to a different domain":

from paperdb.query import FacetQuery, StructuredFilter, composable_search
from paperdb.models import Archetype

results = composable_search(
    queries=[
        FacetQuery(facet="method_summary", query_paper_id=X, weight=2.0),
    ],
    filt=StructuredFilter(
        archetype=Archetype.EMPIRICAL_ML,
        domain_any=["bio"],   # explicitly *not* X's domain
    ),
    k=20,
)

A single-vector index physically cannot separate those two axes — you would be searching one blurry blob and hoping. The lift this buys over the single-vector baseline is exactly what paperdb eval-retrieval measures.

Capability matrix

Capability Keyword (BM25) Single-vector (SPECTER) PaperDB per-facet
Find papers about transformers
Find papers about transformers in vision brittle ✓ (blurry) ✓ (precise)
Same method as paper X, different domain
Theory papers using PAC-Bayes proofs partial ✓ (proof_technique_embedding + filter)
Combine vector similarity with WHERE year > 2022 ✗ (separate index) ✓ (one SQL query)
Sub-100ms latency at 100k papers ✓ (HNSW per facet)

Where embeddings show up in the code

Embeddings appear at four points in the pipeline:

  1. Computeembed/backends.py wraps BGE-M3 (local, default) or OpenAI text-embedding-3-*. Both produce L2-normalized 1024-dim vectors so cosine distance behaves well. embed/facet_embeddings.py:embed_extracted runs each non-null facet text through the embedder and returns {column_name: vector}.
  2. Store — every embedding column is vector(1024) in src/paperdb/sql/, wrapped in an HNSW index with cosine distance (USING hnsw (col vector_cosine_ops) WITH (m = 16, ef_construction = 64)).
  3. Queryquery/engine.py holds a FACET_REGISTRY mapping facet names → (table, embedding_column, archetype). Each FacetQuery resolves to one ANN scan against one HNSW index. The <=> operator is pgvector's cosine distance; 1 - (col <=> qvec) is the similarity the engine ranks by.
  4. Evaluateevaluate/facet_quality.py uses cosine similarity of embeddings to score free-text facets like problem_statement or method_summary, since string equality is a poor metric there.

Embed vs SQL: not every facet gets a vector

The schema embeds only facets where semantic similarity is meaningful as a query target — method_summary yes, code_url no; datasets yes (you want fuzzy matching across "MS-MARCO" / "MS MARCO"), metrics no (you would query that with SQL: WHERE metric->>'value' > 0.85). That split — embed-and-ANN vs store-and-SQL-filter — is the point of the relational + vector hybrid: embeddings for fuzzy semantic axes, SQL for crisp structured ones. Most queries use both at once via composable_search(queries=[...], filt=...).

Quickstart

# 1. Bring up Postgres + pgvector
docker compose up -d

# 2. Install
pip install -e ".[embeddings,dev]"

# 3. Configure
cp .env.example .env  # then edit

# 4. Apply schema
paperdb init-db

# 5. End-to-end on a small batch
paperdb run --n 20

# 6. Query
paperdb search "method_summary=mixture of experts" "datasets=ImageNet" --k 10
paperdb search "abstract=retrieval augmented generation" --archetype empirical_ml --domain nlp

# 7. Inspect
paperdb stats
paperdb facets

The Anthropic API key in .env is required for the LLM extraction. Embeddings default to local BGE-M3 via sentence-transformers; switch to OpenAI by setting PAPERDB_EMBEDDING_BACKEND=openai.

Run the API

uvicorn paperdb.api:app --reload
# http://localhost:8000/docs

Evaluation

# Per-facet F1 against a hand-labeled validation set
paperdb eval --validation-set data/eval/validation.jsonl

# Faceted retrieval vs single-embedding baseline
paperdb eval-retrieval --benchmark data/eval/retrieval.jsonl --k 10

# Method x domain sparsity (gap detection)
paperdb gaps --min-count 2

Project layout

src/paperdb/
  config.py             # pydantic-settings
  db.py                 # connection pool, schema init
  models.py             # pydantic models for facets
  ingest/               # arXiv, S2, PDF
  extract/              # archetype classifier + 2-pass facet extraction + LLM client
  embed/                # BGE / OpenAI backends + per-facet embedding
  load/                 # writes into papers + extension tables
  flows/                # Prefect DAGs
  query/                # composable search engine
  evaluate/             # F1 per facet, precision@k vs baseline
  gaps.py               # method x domain sparsity
  api.py                # FastAPI
  cli.py                # typer CLI
  prompts/              # markdown prompts (versioned)
  sql/                  # DDL files applied by init-db
docs/
  archetypes.md         # taxonomy + per-archetype facet lists
  architecture.md       # data flow and design decisions
  data_dictionary.md    # every column, every table
notebooks/
  gaps_and_distributions.ipynb
tests/

Milestones

See the proposal for the full 9-week plan. The repo currently implements Weeks 1-7: schema + ingestion + extraction + embedding + loading + composable query + evaluation. Week 8 (gap detection) and Week 9 (documentation polish) are in gaps.py and docs/ respectively.

About

Structured decomposition and faceted embedding pipeline for research papers. Extracts orthogonal facets (method, theory, domain, claims, results, etc.) into a queryable tabular schema with per-column embeddings for composable, multi-axis retrieval in LLM workflows.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors