An AI-powered solution for automatic document classification and extracting structured data from scanned documents
Ivan Yang Rodriguez Carranza
- ๐ฏ Problem Definition
- ๐ฌ Methodology
- ๐ Requirements
- ๐๏ธ Architecture
- โ๏ธ Implementation
- ๐งช Testing
- ๐ Setup & Usage
- ๐ Sample Outputs
Organizations daily handle vast amounts of paper-based documents (invoices, forms, contracts, receipts, and reports) that contain valuable structured data trapped in image format. Manual data entry from these documents is time-consuming, error-prone, and expensive, while existing OCR solutions often lack the intelligence to automatically classify document types and extract data in a structured, usable format. SmartDoc addresses this challenge by providing an AI-powered solution that not only accurately extracts text from scanned documents but also intelligently classifies them and organizes the extracted data into structured formats, enabling organizations to digitize their document workflows efficiently and reliably.
The project is built step-by-step through four main stages, as shown in the diagram below:
graph LR
A[๐ Requirements
+
๐Test Plan] --> B[๐๏ธ Architecture Design]
B --> C[โ๏ธ Implementation]
C --> D[๐งช Testing]
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
The following table outlines all project requirements organized by category, with unique identifiers and priority levels to guide testing:
| Category | ID | Description | Testing Priority |
|---|---|---|---|
| ๐ง Functional | FR-001 | Accept image uploads (.jpg) of scanned documents | Low |
| ๐ง Functional | FR-002 | Perform OCR text extraction from document images | High |
| ๐ง Functional | FR-003 | Automatically identify document types (invoice, form, contract, etc.) | High |
| ๐ง Functional | FR-004 | Extract key entities from documents | High |
| ๐ง Functional | FR-005 | Provide API endpoint for document type identification and entity extraction | High |
| ๐ง Functional | FR-006 | Pipeline processing of multiple document images | Medium |
| ๐ง Functional | FR-007 | Save document, document type and extracted entities in database | Medium |
| โก Performance | NFR-001 | Average processing time per document in pipeline less than 1 second | Medium |
| โก Performance | NFR-002 | Achieve minimum 70% accuracy for document classification | High |
| โก Performance | NFR-003 | Achieve minimum 70% precision and recall for entity extraction | High |
| โก Performance | NFR-004 | Embedding processing optimized for MPS hardware acceleration | Low |
| ๐ง Maintainability | NFR-005 | Easy configuration to change OCR, LLM, embedding, or database providers | Medium |
| ๐ Scalability | NFR-005 | Pipeline architecture designed for future scalability and horizontal scaling | Low |
| ๐ ๏ธ Technical | TR-001 | Django framework implementation | Low |
| ๐ ๏ธ Technical | TR-002 | Django APIView for single document type classification and entity extraction endpoint | Low |
| ๐ ๏ธ Technical | TR-003 | Django management commands for pipeline processing | Low |
| ๐ ๏ธ Technical | TR-004 | ChromaDB integration for storing documents and extracted entities | Low |
The testing strategy includes three types of tests:
- Smoke Tests: Validate core high-priority testing requirements functionality and provide rapid failure detection.
- Performance Tests: Validates that the workflow and endpoint meet performance requirements.
- Evaluations: Evaluate the two main features (document processing pipeline and classification/entity extraction endpoint) through real-world scenarios to assess practical effectiveness.
| Test ID | Test Case | Expected Outcome | Requirement ID |
|---|---|---|---|
| ST-001 | OCR Text Extraction | OCR service extracts readable text from uploaded document images | FR-002 |
| ST-002 | Document Classification | LLM service correctly identifies document types (invoice, form, contract) with >70% accuracy | FR-003 |
| ST-003 | Entity Extraction | Analysis service extracts key entities (dates, amounts, names) with structured output | FR-004 |
| ST-004 | API Endpoint Functionality | API endpoint accepts document image uploads and returns document type and extracted entities in JSON format | FR-005 |
| ST-005 | Document Type Classification | System correctly identifies 4 document types | NFR-002 |
| Test ID | Test Case | Expected Outcome | Requirement ID |
|---|---|---|---|
| PT-001 | Pipeline Processing Performance | System processes 5000 documents through the pipeline in maximum 30 minutes (average <0.36 seconds per document) | NFR-001 |
| PT-002 | API Endpoint Response Time | Single document analysis via /analyze/ endpoint has average response time less than 5 seconds for standard document sizes |
NFR-001 |
Note: Performance benchmarks are based on testing with MacBook Pro M4 with 14 cores. Results may vary depending on hardware and provider configurations.
| Test ID | Test Case | Expected Outcome | Requirement ID |
|---|---|---|---|
| EV-001 | Endpoint Classification Accuracy | Using dataset split (70% indexed, 15% for testing ~700 documents, 15% for validation), endpoint achieves โฅ70% accuracy in document classification | NFR-002, FR-003, FR-006 |
| EV-002 | Entity Extraction Precision & Recall | Using a test set of 100 documents with ground truth annotations, achieves Precision โฅ 0.70 (correctly extracted entities / total extracted entities) and Recall โฅ 0.70 (correctly extracted entities / total ground truth entities) | FR-004, NFR-003 |
The architecture is organized into a layered architecture with five main components:
| Component | Description | Design Rationale | Requirements |
|---|---|---|---|
| ๐ ๏ธ Management Layer | Contains Django management command for pipeline processing of multiple document images | Command class with separate configuration class design enables flexible argument handling | TR-003, FR-006, TR-001 |
| ๐ API Layer | Exposes an /analyze/ endpoint for document classification and entity extraction |
View class with request handler design separates HTTP processing from business logic | FR-001, FR-005, TR-002 |
| ๐ Pipeline Orchestration | Orchestrates a configurable document processing workflow through parallel tasks | Modular design enabling flexible workflow configuration and parallel processing with Prefect for future scalability | NFR-001, NFR-005 |
| โ๏ธ Service Layer | Provides modular services for OCR text extraction, LLM inference, embeddings generation, and document analysis | Modular architecture for testability with reuse across API layer and Django commands to avoid duplication of responsibilities and allows to easily add more providers for any service (OCR, LLM, embedding, vector DB) | FR-002, FR-003, FR-004, NFR-004, NFR-003, NFR-002 |
| ๐ Providers | Implements specific service providers (OCR, LLM, Embedding, Vector DB) and allows for easy addition of new providers | Interface-based design enables plug-and-play replacement of providers without changing service layer implementation | NFR-005 |
| ๐พ Data Layer | Stores documents, document types and extracted entities using ChromaDB vector database | ChromaDB interface enables simple indexing and similarity search of data with support for implementing other vector databases in this layer | FR-007, TR-004 |
graph TB
subgraph "๐ ๏ธ Management Layer"
CMD[Django Commands]
end
subgraph "๐ API Layer"
API[DocumentAnalysisView]
ENDPOINT["/analyze/"]
end
subgraph "๐ Pipeline Orchestration"
TASKS[Tasks]
CONFIG[Configuration]
FLOWS[Flows]
end
subgraph "โ๏ธ Service Layer"
ANALYSIS[Analysis Service]
subgraph CORE_SERVICES ["๐ง Core Services"]
OCR[OCR Service]
LLM[LLM Service]
EMB[Embedding Service]
VDB[Vector DB Service]
end
end
subgraph PROVIDERS ["๐ Providers"]
OCR_PROV[OCR Provider]
LLM_PROV[LLM Provider]
EMB_PROV[Embedding Provider]
VDB_PROV[Vector DB Provider]
OTHER[Other Providers]
end
subgraph "๐พ Data Layer"
CHROMA[ChromaDB]
end
ENDPOINT --> API
CMD --> FLOWS
FLOWS --> TASKS
CONFIG --> TASKS
API --> ANALYSIS
ANALYSIS --> CORE_SERVICES
TASKS --> CORE_SERVICES
OCR --> OCR_PROV
LLM --> LLM_PROV
EMB --> EMB_PROV
VDB --> VDB_PROV
VDB_PROV --> CHROMA
style CMD fill:#e3f2fd
style API fill:#fce4ec
style ENDPOINT fill:#fce4ec
style FLOWS fill:#f3e5f5
style TASKS fill:#f3e5f5
style CONFIG fill:#f3e5f5
style OCR fill:#e8f5e8
style LLM fill:#e8f5e8
style EMB fill:#e8f5e8
style ANALYSIS fill:#e8f5e8
style VDB fill:#e8f5e8
style OCR_PROV fill:#fff3e0
style LLM_PROV fill:#fff3e0
style EMB_PROV fill:#fff3e0
style VDB_PROV fill:#fff3e0
style OTHER fill:#fff3e0
style CHROMA fill:#fff3e0
- Backend: Python 3.11+
- Web Framework: Django + Django REST Framework
- OCR: Tesseract
- Workflow Orchestration: Prefect
- Vector Database: ChromaDB
- AI: OpenAI Responses API
- Embedding Models: SentenceTransformer (all-MiniLM-L6-v2), OpenCLIP
smartdoc/
โโโ api/ # Django app with core functionality
โ โโโ management/ # Django management commands
โ โ โโโ commands/ # Custom commands (process_documents)
โ โโโ pipelines/ # Workflow orchestration
โ โ โโโ tasks/ # Individual processing tasks
โ โ โโโ flows/ # Prefect workflow definitions
โ โ โโโ config/ # Pipeline configuration
โ โโโ services/ # Service module
โ โ โโโ embedding/ # Embedding generation services
โ โ โโโ llm/ # LLM inference services
โ โ โโโ ocr/ # OCR text extraction services
โ โ โโโ analysis/ # Classification and entity extraction for endpoint
โ โ โโโ vectordb/ # Vector database operations
โ โโโ data/ # Data module
โ โโโ views.py # API endpoints
โโโ smartdoc/ # Django project settings
โโโ tests/ # Test suite for the application
โโโ notebooks/ # Development and testing notebooks
โโโ logs/ # Output log examples from processing and testing
โโโ chromadb/ # ChromaDB vector database storage
Note: This shows the key directories and files, additional files are not displayed for clarity.
Collection Purposes:
smartdoc_documents: Main storage for processed documents with full text and entity datasmartdoc_classifier_images: Image embeddings for visual similarity search during classificationsmartdoc_classifier_text: Text embeddings for textual similarity search during classificationsmartdoc_document_types: Reference data storing document type definitions and expected entities
The following diagram illustrates the database schema and structure of these collections:
graph LR
MAIN_DOCS["smartdoc_documents<br/><br/>id<br/>type<br/>ocr_text<br/>base64<br/>entities<br/>indexed_at"]
CLASSIFIER_IMG["smartdoc_classifier_images<br/><br/>id<br/>uuid<br/>type"]
CLASSIFIER_TEXT["smartdoc_classifier_text<br/><br/>id<br/>uuid<br/>type"]
DOC_TYPES["smartdoc_document_types<br/><br/>id<br/>type<br/>entities<br/>saved_at"]
MAIN_DOCS ~~~ CLASSIFIER_IMG ~~~ CLASSIFIER_TEXT ~~~ DOC_TYPES
style MAIN_DOCS fill:#f5f5f5
style CLASSIFIER_IMG fill:#f5f5f5
style CLASSIFIER_TEXT fill:#f5f5f5
style DOC_TYPES fill:#f5f5f5
The following diagram illustrates the complete document processing pipeline flow from start to finish:
graph LR
START[๐ Start Pipeline] --> INIT_STAGE[๐ฆ Initialization Stage]
subgraph INIT_STAGE [๐ง Initialization - Parallel]
direction TB
SCAN[๐ Scan Directory<br/>for Images]
INIT_VDB[๐๏ธ Initialize<br/>Vector Database]
INIT_TEXT_EMB[๐ Initialize Text<br/>Embedding Provider]
INIT_IMG_EMB[๐ผ๏ธ Initialize Image<br/>Embedding Provider]
SCAN ~~~ INIT_VDB
INIT_VDB ~~~ INIT_TEXT_EMB
INIT_TEXT_EMB ~~~ INIT_IMG_EMB
end
INIT_STAGE --> BATCH_PROC[๐ Batch Processing]
subgraph BATCH_PROC [โก Document Processing]
direction TB
DOC_PIPELINE[๐ Process Document Pipeline]
DOC_PIPELINE_N[๐ Document ...]
DOC_PIPELINE_N_PLUS_1[๐ Document N]
DOC_PIPELINE ~~~ DOC_PIPELINE_N
DOC_PIPELINE_N ~~~ DOC_PIPELINE_N_PLUS_1
end
subgraph DOC_PIPELINE [๐ Document 1]
direction LR
OCR[๐ OCR Text<br/>Extraction]
CLASSIFY[๐ท๏ธ Document<br/>Classification]
ENTITIES[๐ฏ Entity<br/>Extraction]
TEXT_EMB[๐ Generate Text<br/>Embedding]
IMG_EMB[๐ผ๏ธ Generate Image<br/>Embedding]
OCR --> CLASSIFY
OCR --> TEXT_EMB
OCR -.-> IMG_EMB
CLASSIFY --> ENTITIES
end
BATCH_PROC --> INDEX_STAGE[๐พ Indexing Stage]
subgraph INDEX_STAGE [๐๏ธ Database Indexing]
INDEX_MAIN[๐ Index to Main<br/>Collection]
INDEX_CLASSIFIER[๐ Index to Classifier<br/>Collections]
INDEX_DOC_TYPES[๐ Index to Document<br/>Types]
end
INDEX_STAGE --> COMPLETE[โ
Pipeline Complete]
style START fill:#e8f5e8
style INIT_STAGE fill:#e1f5fe
style BATCH_PROC fill:#f3e5f5
style DOC_PIPELINE fill:#fff3e0
style DOC_PIPELINE_N fill:#fff3e0
style DOC_PIPELINE_N_PLUS_1 fill:#fff3e0
style INDEX_STAGE fill:#e8f5e8
style COMPLETE fill:#c8e6c9
Based on the smoke-test plan, the tests were run successfully with no errors. See logs/smoke_tests_run.log for details.
The process_documents command was run on docs-sm_samples containing 3494 images (70% of the dataset) and evaluated using 790 test images (15% of the dataset).
The processing of 4.6M completion tokens cost about $3.1 with GPT-4.1-mini.
The run logs (partial) are available in /logs/process_documents_run.log.
Results:
The total time for running the process_documents command was 6373 seconds (1.8 seconds per document). A higher batch size and number of workers would reduce the total time.
The random baseline accuracy for 16 document classes is 6.25% (1/16)
The accuracy achieved when evaluating the /analyze endpoint with the test images was 50.54%, which is higher than the random baseline. Important: Preliminary tests in the notebooks folder show that running the same test while indexing with the true document types increases the overall accuracy to 70.7% without any other changes.
Accuracy by Document Type:
| Document Type | Accuracy |
|---|---|
| advertisement | 32.7% |
| budget | 22.6% |
| 95.3% | |
| file_folder | 23.9% |
| form | 10.2% |
| handwritten | 77.1% |
| invoice | 71.1% |
| letter | 46.8% |
| memo | 43.5% |
| news_article | 52.5% |
| presentation | 2.1% |
| questionnaire | 65.2% |
| resume | 47.8% |
| scientific_publication | 93.5% |
| scientific_report | 50.0% |
| specification | 86.7% |
As shown in the previous table, the classifier achieves high accuracy for some categories but underperforms in others. The lower scores could result from limitations of the current OCR provider, extracting text in Markdown format might improve the detection of forms, budgets, and scientific reports. For advertisements, incorporating a multimodal LLM service and higher-quality image embeddings could increase accuracy, although this would increase processing time and cost.
As mentioned before, when the true document type is used for indexing, the accuracy increases significantly to 70%. With better embeddings and OCR, it could increase even further.
Note: Additional testing for entity extraction is still required.
- Python 3.11+ (3.8 or newer works, but the project is tested on 3.11)
- Git (to clone the repository)
- Tesseract OCR 5+ (command-line tool must be available on your PATH; e.g.
brew install tesseracton macOS orsudo apt-get install tesseract-ocron Ubuntu) - OpenAI API key (set the
OPENAI_API_KEYenvironment variable โ required for LLM-powered classification and entity extraction) - PyTorch (installed automatically with
open-clip-torchfromrequirements.txt; having a GPU or Apple Silicon chip is optional but highly recommended for faster embeddings)
# 1. Clone and setup
git clone https://github.com/rodcar/smartdoc.git
cd smartdoc
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 2. Install dependencies
pip install -r requirements.txt
# 3. Install Tesseract OCR
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# 4. Setup database
python manage.py migrate
# 5. (Optional) Disable embedding preloading for development
# export SMARTDOC_PRELOAD_EMBEDDINGS=false# Start server
python manage.py runserver
# Test API
curl -X POST http://localhost:8000/api/analyze/ \
-F "file=@document.jpg"Response:
{
"document_type": "invoice",
"entities": {
"total_amount": "150.00",
"date": "2024-01-15",
"vendor": "ABC Company"
},
"confidence": 0.85
}# Process multiple documents
python manage.py process_documents /path/to/documents/# Set dataset path for tests
export SMARTDOC_DATASET_PATH=/path/to/docs-sm
# Run all tests
python manage.py test testsInput: invoice_sample.jpg
{
"document_type": "invoice",
"confidence": 0.92,
"entities": {
"vendor_name": "ABC Company",
"invoice_number": "INV-2024-001",
"invoice_date": "2024-01-15",
"total_amount": 132.00,
"currency": "USD"
}
}Input: application_form.jpg
{
"document_type": "application_form",
"confidence": 0.88,
"entities": {
"applicant_name": "Sarah Johnson",
"phone_number": "(555) 123-4567",
"email": "sarah.johnson@email.com",
"address": "456 Oak Avenue, Springfield, IL 62701"
}
}SmartDoc's modular architecture makes it easy to extend and customize:
# In api/services/analysis/document_classifier.py
def classify_document(self, text: str) -> str:
# Add new document types to the classification logic
new_types = ["receipt", "tax_document", "legal_contract"]
# Update classification prompt or modelSmartDoc's modular design allows you to easily swap out different service providers without changing the core logic. For example, you can replace Tesseract OCR with Google Vision API, or switch from OpenAI to local LLM models. Simply create a new service class that follows the same interface, then update the configuration settings to use your preferred provider.
The Prefect-based pipeline system lets you create custom document processing workflows for different use cases. You can build specialized workflows by adding new flow definitions in the api/pipelines/flows/ folder, create custom processing tasks in the api/pipelines/tasks/ folder, and configure workflow settings in api/pipelines/config/.
- Create a new service or unify the analysis service for both the endpoint and the pipeline.
- Implement a vector database provider that allows parallel indexing to reduce indexing time.
- Deploy pipeline on cloud to increase number of workers in order to reduce processing time.
- Make model selection configurable.
- Experiment with different model sizes, which might help reduce processing time.
- Train smaller models to improve document classification accuracy and reduce processing time and costs.
- Use the stored entities for specific document types to enable entity-based search. Also, this could be used to improve document classification.
SmartDoc is licensed under the Apache License 2.0.
ยฉ 2025 Ivan Yang Rodriguez Carranza.