Faceted decomposition and embedding pipeline for structured research-paper retrieval. Ingests CS/ML papers from arXiv + Semantic Scholar, decomposes them into a fixed set of orthogonal facets via LLM extraction, and stores the result in PostgreSQL + pgvector with HNSW indexes on per-facet embeddings. Supports composable multi-axis queries that single-vector embedding search cannot express.
Course project for NYU CDS Data Engineering, Spring 2026. See
paperdb_proposal.pdf for the full design.
arXiv + S2 + PDFs ──▶ staging_papers ──▶ classify(archetype)
│
▼
extract facets (universal + archetype-specific)
│
▼
embed per-facet text
│
▼
load into papers + extension table
│
▼
composable_search() / FastAPI / CLI
Universal facets live on the papers core table; archetype-specific facets
live on one of empirical_ml_facets, theory_facets, systems_facets,
dataset_facets. Each paper has exactly one extension row, determined by
its primary archetype. See docs/architecture.md and
docs/archetypes.md.
Existing paper-search systems take one of two approaches:
- Keyword search (Google Scholar, PubMed) is brittle to paraphrasing — "transformer-based machine translation" and "encoder-decoder attention model for MT" miss each other entirely.
- Single-vector embeddings (SPECTER, OpenAlex) embed the entire abstract into one vector. That vector is a blurry average of method, domain, theory, and compute — so a paper about transformers for vision and transformers for NLP land near each other because both contain "transformer," even though only one matches if your query is "vision."
PaperDB instead embeds each facet separately, into its own vector column with its own HNSW index:
method_summary_embeddinglives in a space where similarity ≈ "same method"datasets_embeddinglives in a space where similarity ≈ "same evaluation data"problem_statement_embeddingcaptures problem similarityproof_technique_embeddingcaptures how a theorem is proved, separately from what is proved (main_result_embedding)- ...and so on for every facet listed in docs/archetypes.md
This is what unlocks the proposal's headline query — "same method as paper X, but applied to a different domain":
from paperdb.query import FacetQuery, StructuredFilter, composable_search
from paperdb.models import Archetype
results = composable_search(
queries=[
FacetQuery(facet="method_summary", query_paper_id=X, weight=2.0),
],
filt=StructuredFilter(
archetype=Archetype.EMPIRICAL_ML,
domain_any=["bio"], # explicitly *not* X's domain
),
k=20,
)A single-vector index physically cannot separate those two axes — you would be
searching one blurry blob and hoping. The lift this buys over the
single-vector baseline is exactly what paperdb eval-retrieval measures.
| Capability | Keyword (BM25) | Single-vector (SPECTER) | PaperDB per-facet |
|---|---|---|---|
| Find papers about transformers | ✓ | ✓ | ✓ |
| Find papers about transformers in vision | brittle | ✓ (blurry) | ✓ (precise) |
| Same method as paper X, different domain | ✗ | ✗ | ✓ |
| Theory papers using PAC-Bayes proofs | partial | ✗ | ✓ (proof_technique_embedding + filter) |
Combine vector similarity with WHERE year > 2022 |
✗ | ✗ (separate index) | ✓ (one SQL query) |
| Sub-100ms latency at 100k papers | ✓ | ✓ | ✓ (HNSW per facet) |
Embeddings appear at four points in the pipeline:
- Compute —
embed/backends.pywraps BGE-M3 (local, default) or OpenAItext-embedding-3-*. Both produce L2-normalized 1024-dim vectors so cosine distance behaves well.embed/facet_embeddings.py:embed_extractedruns each non-null facet text through the embedder and returns{column_name: vector}. - Store — every embedding column is
vector(1024)insrc/paperdb/sql/, wrapped in an HNSW index with cosine distance (USING hnsw (col vector_cosine_ops) WITH (m = 16, ef_construction = 64)). - Query —
query/engine.pyholds aFACET_REGISTRYmapping facet names →(table, embedding_column, archetype). EachFacetQueryresolves to one ANN scan against one HNSW index. The<=>operator is pgvector's cosine distance;1 - (col <=> qvec)is the similarity the engine ranks by. - Evaluate —
evaluate/facet_quality.pyuses cosine similarity of embeddings to score free-text facets likeproblem_statementormethod_summary, since string equality is a poor metric there.
The schema embeds only facets where semantic similarity is meaningful as a
query target — method_summary yes, code_url no; datasets yes (you want
fuzzy matching across "MS-MARCO" / "MS MARCO"), metrics no (you would query
that with SQL: WHERE metric->>'value' > 0.85). That split — embed-and-ANN
vs store-and-SQL-filter — is the point of the relational + vector hybrid:
embeddings for fuzzy semantic axes, SQL for crisp structured ones. Most
queries use both at once via composable_search(queries=[...], filt=...).
# 1. Bring up Postgres + pgvector
docker compose up -d
# 2. Install
pip install -e ".[embeddings,dev]"
# 3. Configure
cp .env.example .env # then edit
# 4. Apply schema
paperdb init-db
# 5. End-to-end on a small batch
paperdb run --n 20
# 6. Query
paperdb search "method_summary=mixture of experts" "datasets=ImageNet" --k 10
paperdb search "abstract=retrieval augmented generation" --archetype empirical_ml --domain nlp
# 7. Inspect
paperdb stats
paperdb facetsThe Anthropic API key in .env is required for the LLM extraction. Embeddings
default to local BGE-M3 via sentence-transformers; switch to OpenAI by setting
PAPERDB_EMBEDDING_BACKEND=openai.
uvicorn paperdb.api:app --reload
# http://localhost:8000/docs# Per-facet F1 against a hand-labeled validation set
paperdb eval --validation-set data/eval/validation.jsonl
# Faceted retrieval vs single-embedding baseline
paperdb eval-retrieval --benchmark data/eval/retrieval.jsonl --k 10
# Method x domain sparsity (gap detection)
paperdb gaps --min-count 2src/paperdb/
config.py # pydantic-settings
db.py # connection pool, schema init
models.py # pydantic models for facets
ingest/ # arXiv, S2, PDF
extract/ # archetype classifier + 2-pass facet extraction + LLM client
embed/ # BGE / OpenAI backends + per-facet embedding
load/ # writes into papers + extension tables
flows/ # Prefect DAGs
query/ # composable search engine
evaluate/ # F1 per facet, precision@k vs baseline
gaps.py # method x domain sparsity
api.py # FastAPI
cli.py # typer CLI
prompts/ # markdown prompts (versioned)
sql/ # DDL files applied by init-db
docs/
archetypes.md # taxonomy + per-archetype facet lists
architecture.md # data flow and design decisions
data_dictionary.md # every column, every table
notebooks/
gaps_and_distributions.ipynb
tests/
See the proposal for the full 9-week plan. The repo currently implements
Weeks 1-7: schema + ingestion + extraction + embedding + loading +
composable query + evaluation. Week 8 (gap detection) and Week 9
(documentation polish) are in gaps.py and docs/ respectively.