AI-powered document classification that herds files into organized folders
Features • Quick Start • Usage • Configuration • Documentation
Drover uses LLMs to analyze documents and suggest consistent, policy-compliant filesystem paths and filenames. Named after herding dogs that drove livestock, Drover herds your scattered files into an organized folder structure.
- Multi-Provider AI — Works with Ollama (local), OpenAI, Anthropic, and OpenRouter
- Intelligent Classification — Categorizes documents by domain, category, and document type
- Smart Sampling — Adaptive page sampling for efficient processing of large documents
- Taxonomy System — Extensible controlled vocabularies with strict or fallback modes
- NARA-Compliant Naming — Generates standardized filenames:
{doctype}-{vendor}-{subject}-{entity}-{date}.pdf. Theentityslot (pet, patient, performer, brand) is optional and is dropped when empty, when it would duplicate the vendor, or for privacy-sensitive domains. Thedateslot is a real YYYYMMDD calendar date or the00000000no-date sentinel; partial-zero or impossible dates from the model collapse to the sentinel rather than entering the filename. - macOS Tagging — Apply classification as native filesystem tags
- Batch Processing — Classify multiple documents with JSONL output
- Evaluation Framework — Measure accuracy against ground truth datasets
- Python 3.14.x
- uv (package and environment manager)
- Ollama (for local inference) or API keys for cloud providers
To install drover as a global CLI on your system, see INSTALL.md. The steps below set up a development checkout for working on drover itself.
# Clone and sync the project (creates .venv and installs dependencies)
git clone https://github.com/ckrough/drover.git
cd drover
uv sync --extra docling
# Download Docling models (one-time, ~500 MB to ~/.cache/docling/models)
uv run docling-tools models downloadRun the CLI through uv run drover ..., or activate the environment with source .venv/bin/activate to call drover directly.
Drover uses Docling as the sole document loader, with full-page OCR enabled on PDFs so vendor names carried in logos and embedded images reach the classifier. The [docling] install extra and the one-time model download above are required. If you skip the download, Docling's first run fetches models from Hugging Face on demand (a few hundred MB, internet required); subsequent runs are fully offline. Rationale and the format-coverage policy live in ADR-005 and ADR-006.
# Using local Ollama (default)
drover classify document.pdf
# Using OpenAI
export OPENAI_API_KEY="sk-..."
drover classify document.pdf --ai-provider openai --ai-model gpt-4oAnalyze documents and output suggested file paths:
drover classify invoice.pdf
drover classify *.pdf --batch # Multiple files, JSONL output
drover classify doc.pdf --metrics # Include AI metrics
drover classify doc.pdf --log-level verbose # Detailed loggingClassify and apply native filesystem tags:
drover tag document.pdf --dry-run # Preview tags
drover tag document.pdf --tag-fields domain,vendor
drover tag --tag-mode replace document.pdf # Replace existing tagsClassify, optionally tag, and move files into a destination tree. SRC may be a single file or a directory; --dest is required.
# Move every supported file under ~/Inbox into ~/Documents/filed.
drover organize ~/Inbox --dest ~/Documents/filed
# Dry-run preview to stdout (one JSONL record per file).
drover organize ~/Inbox --dest ~/Documents/filed --dry-run --report -
# Single-file invocation suitable for Hazel rules or macOS Folder Actions.
drover organize ~/Inbox/scan.pdf --dest ~/Documents/filed \
--tag-fields category,doctype \
--report ~/Library/Logs/drover/run-$(date +%Y%m%d-%H%M%S).jsonl
# Copy instead of move; source preserved.
drover organize ~/Inbox --dest ~/Documents/filed --copyBehavior:
- Default conflict policy is skip: if
{DEST}/{suggested_path}already exists, the source is left in place and the record isskipped_exists. Re-running on the same source is idempotent. --tag-fieldsis validated againstdomain, category, doctype, vendor, date, subjectat parse time. Tags are written only to the destination Drover produces; the source file is never tagged.- The JSONL report is identical in
--dry-runand live mode (status values are prefixedwould_in dry-run). Records carryoriginal_path,suggested_path,final_destination,status,tags_applied, anderror. - Stream discipline: log chatter goes to stderr (gated by
DROVER_LOG_LEVEL); the JSONL report goes to the file you pass, or to stdout when--report -. Unsupported extensions surface asskipped_unsupportedrecords and do not raise the exit code.
A Hazel rule that watches ~/Inbox/, runs drover organize on every newly added file, and appends to a daily JSONL log:
drover organize "$1" \
--dest ~/Documents/filed \
--tag-fields category,doctype \
--report ~/Library/Logs/drover/$(date +%Y-%m-%d).jsonlEquivalent macOS Folder Action shell:
for f in "$@"; do
/usr/local/bin/drover organize "$f" \
--dest ~/Documents/filed \
--tag-fields category,doctype \
--report ~/Library/Logs/drover/$(date +%Y-%m-%d).jsonl
doneMeasure classification accuracy against ground truth:
drover evaluate eval/ground_truth/synthetic.jsonl
drover evaluate eval/ground_truth/synthetic.jsonl --output-format json{
"original": "scan001.pdf",
"suggested_path": "financial/banking/statement/statement-chase-checking-20240115.pdf",
"domain": "financial",
"category": "banking",
"doctype": "statement",
"vendor": "chase",
"date": "20240115",
"subject": "checking"
}| Variable | Description | Default |
|---|---|---|
DROVER_AI_PROVIDER |
AI provider (ollama, openai, anthropic, openrouter) | ollama |
DROVER_AI_MODEL |
Model name | gemma4:latest |
DROVER_TAXONOMY |
Classification taxonomy | household |
DROVER_NAMING_STYLE |
Filename policy | nara |
DROVER_SAMPLE_STRATEGY |
Page sampling (full, first_n, bookends, adaptive) | adaptive |
DROVER_LOG_LEVEL |
Logging verbosity (quiet, verbose, debug) | quiet |
Drover searches for configuration in order: --config PATH → drover.yaml → ~/.config/drover/config.yaml
# drover.yaml
ai:
provider: openai
model: gpt-4o
temperature: 0.0
taxonomy: household
taxonomy_mode: fallback
naming_style: nara
concurrency: 4| Provider | API Key Variable | Example Model |
|---|---|---|
| Ollama | — (local) | gemma4:latest |
| OpenAI | OPENAI_API_KEY |
gpt-4o |
| Anthropic | ANTHROPIC_API_KEY |
claude-sonnet-4-20250514 |
| OpenRouter | OPENROUTER_API_KEY |
anthropic/claude-sonnet-4 |
The loader is Docling, so the supported set matches Docling's officially-supported formats per docs/usage/supported_formats. See ADR-006 for the audit.
| Category | Extensions |
|---|---|
.pdf |
|
| Office (Open XML) | .docx, .xlsx, .pptx |
| Markup | .txt, .md, .html, .htm |
| Data | .csv |
| Images | .png, .jpg, .jpeg, .tiff, .tif, .bmp |
Drover follows a pipeline architecture with extensible plugin systems:
[Document] → [Loader] → [Classifier] → [PathBuilder] → [Output]
↓ ↓ ↓
[Sampling] [Taxonomy] [NamingPolicy]
Tech Stack:
- CLI: Click
- LLM: LangChain with structured output
- Config: Pydantic
- Logging: structlog
# Install with dev dependencies
uv sync --all-extras
# Run tests
uv run pytest
# Run end-to-end smoke suite (LLM tests need Ollama; emits JSON report under smoke/reports/)
uv run python smoke/run.py # full suite (~2 min)
uv run python smoke/run.py --skip-llm # CLI + error-path tests only (~10s)
# See smoke/README.md for the test catalog and report schema.
# Lint and format
uv run ruff check src/ --fix && uv run ruff format src/
# Security scan
uv run bandit -r src/ -c pyproject.tomlSee CONTRIBUTING.md for the full development workflow.
- Contributing Guide — Development setup, architecture, and extension guides
- ADR-001: Chain-of-Thought Prompting — 7-step reasoning for accurate classification
- ADR-002: Privacy-First Design — Local-first, zero telemetry approach
- ADR-003: NLI Classifier Roadmap — Zero-shot NLI exploration (superseded by ADR-004)
- ADR-004: Local LLM as Primary Local Path — Ollama gemma4 as the default local classifier
- ADR-005: Docling with Full-Page OCR as the Default PDF Loader — Structure-aware loading with OCR over logos and embedded images for accurate folder placement
Apache License 2.0 (Apache-2.0). See LICENSE for the full text.
Copyright (C) 2025 Backchain LLC
I'm Chris Krough. I build local-first AI tooling like Drover and write about putting AI to work in real systems. More of my work lives at dev.krough.org, and I'm on LinkedIn.
Drover is published by Backchain, my AI transformation consulting practice. Backchain helps teams discover where AI works. If document classification is one piece of a larger automation problem you're solving, that's the conversation Backchain is built for.
If Drover saves you time, a star on GitHub helps other people find it.
