A complete Python-based Invoice Processing System that accepts invoices (PDF, DOCX, PNG, JPG, JPEG), converts documents to layout-preserved image frames, runs OCR (PaddleOCR / EasyOCR / PyTesseract) with adaptive fallbacks, and morphologically extracts tables & cells using OpenCV.
New in this release: The system now features a Dual-Engine Extraction pipeline that combines rule-based heuristics with AI gap-filling, an intelligent categorizer, and a context-aware conversational AI chatbot with memory. Results are displayed in a premium, responsive Streamlit web interface that enables interactive field conflict resolution, chat functionality, and data exports.
- Multi-Format Ingestion:
- PDF support with Poppler and PyMuPDF fallback (pure Python, zero binary dependency).
- DOCX support with dynamic text and table renderer to PIL canvas.
- Standard Image support (PNG, JPG, JPEG).
- Unified OCR Wrapper:
- Select between PaddleOCR, EasyOCR, or PyTesseract.
- Automatic cascade fallback (PaddleOCR ➔ EasyOCR ➔ PyTesseract) to ensure the system runs immediately on any environment.
- Dual-Engine Extraction & Conflict Resolution:
- Rule-Based: Heuristic, Regex, & Proximity Field Matchers detect fields (Invoice Number, Dates, GSTIN, Amounts) and map tables using OpenCV morph kernels.
- AI-Based: Uses Large Language Models (LLMs) to independently extract fields, automatically filling gaps missed by the rule engine.
- Resolution Panel: Side-by-side UI to compare rule-based vs. AI-extracted values and manually resolve discrepancies.
- Context-Aware AI Chatbot:
- Dedicated WhatsApp-style chat panel to interrogate your invoices.
- Automatically injects document metadata, extracted fields, table grid text, and raw OCR text into the LLM system prompt for granular awareness.
- Maintains conversation history (up to 20 messages).
- Intelligent Categorization & Risk Management:
- Classifies invoices into 9 business categories and assigns expense tags using keyword strategies or LLM wrappers.
- Risk Indicators: Warns about missing critical fields (e.g., due dates) and flags potential duplicate invoices.
- Multi-Provider LLM Wrapper & Database Backing:
- Seamlessly plug in OpenAI (
gpt-4o-mini), Google Gemini (gemini-1.5-flash), or Anthropic (claude-3-haiku). - Persistent storage using SQLAlchemy (defaults to SQLite, supports PostgreSQL).
- Seamlessly plug in OpenAI (
- Python 3.11+
- Streamlit (Web Application Interface & Chat UI)
- OpenAI / Google Gemini / Anthropic APIs (LLM Integrations)
- SQLAlchemy (Database ORM)
- OpenCV (Morphological analysis and line extraction)
- NumPy & Pandas (Data structures and tables)
- Pillow (Image operations and alpha compositing)
- python-docx & pdf2image & PyMuPDF (fitz) (Document parsing/rendering)
- pytesseract, easyocr, paddleocr (OCR execution options)
Invoice/
├── app.py # Core Streamlit Web Application Dashboard
├── requirements.txt # Python Package Dependencies
├── config.py # Centralized Configuration (colors, regex, keywords)
├── .env # API Keys and Environment Variables
├── README.md # Installation & Setup documentation
├── modules/
│ ├── converter.py # PDF/DOCX to Image Conversion Pipeline
│ ├── ocr_engine.py # Unified OCR Engine Wrapper with auto-fallbacks
│ ├── field_detector.py # Heuristic, Regex & Keyword Field Detection
│ ├── table_detector.py # OpenCV Table Grid & Cell Mapping
│ └── bbox_drawer.py # Bounding Box Overlays & Labels Canvas Drawer
├── services/
│ ├── invoice_service.py # Core CRUD and Ingestion pipeline logic
│ ├── llm_service.py # Multi-provider LLM API router (OpenAI/Gemini/Anthropic)
│ ├── categorization_service.py # Strategy pattern for auto-tagging
│ └── chat_service.py # Context injection and conversation history manager
├── ui/
│ ├── sidebar.py # File uploads, filters, and history lists
│ ├── invoice_viewer.py # Document preview, conflict resolution, and data tabs
│ └── chat_panel.py # Conversational AI interface
├── database/ # SQLAlchemy models and schemas
├── tests/ # Comprehensive unittest test suite
├── storage/ # Persistent storage (invoices, annotations)
├── uploads/ & outputs/ # Temporary processing directories
└── data/ # SQLite database location
For PDF parsing and Tesseract OCR to run correctly, install the system binaries:
# Install Poppler (for pdf2image)
brew install poppler
# Install Tesseract (for pytesseract fallback)
brew install tesseract# Install Poppler & Tesseract
sudo apt-get update
sudo apt-get install -y poppler-utils tesseract-ocr libtesseract-dev- Poppler: Download poppler for Windows (e.g. from github/poppler-windows), extract it, and add the
bin/folder to your system PATH. - Tesseract: Download the installer from UB Mannheim Tesseract, run installation, and add
C:\Program Files\Tesseract-OCRto your system PATH.
# Clone or navigate to the project directory
cd Invoice
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows (Command Prompt):
# venv\Scripts\activate.bat
# Upgrade pip
pip install --upgrade pip
# Install dependencies
pip install -r requirements.txtNote: If you wish to use PaddleOCR, install paddlepaddle and paddleocr manually if they compile on your platform:
# Install PaddlePaddle CPU (or GPU version)
pip install paddlepaddle
# Install PaddleOCR
pip install paddleocrIf PaddleOCR fails to install due to compilation issues on macOS Apple Silicon, the system will seamlessly run EasyOCR or PyTesseract instead.
To enable the AI capabilities (Chatbot, LLM Extraction, Smart Categorization), configure your API keys.
Create a .env file in the root directory:
cp .env.example .envOpen .env and add your API key for at least one of the supported providers:
OPENAI_API_KEY=YOUR_OPENAI_KEY
GOOGLE_API_KEY=YOUR_GEMINI_KEY
ANTHROPIC_API_KEY=YOUR_CLAUDE_KEY(The system will automatically detect which keys are available and use the corresponding models).
Ensure all modules compile and pass their unit assertions:
python -m unittest tests/test_pipeline.pystreamlit run app.pyOpen http://localhost:8501 in your browser. Upload a document from the sidebar to process it through the dual-engine pipeline, resolve fields, and start chatting with your invoice!