Skip to content

dhadgevedant/Invoice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Invoice Bounding Box Detection & AI Chat System

A complete Python-based Invoice Processing System that accepts invoices (PDF, DOCX, PNG, JPG, JPEG), converts documents to layout-preserved image frames, runs OCR (PaddleOCR / EasyOCR / PyTesseract) with adaptive fallbacks, and morphologically extracts tables & cells using OpenCV.

New in this release: The system now features a Dual-Engine Extraction pipeline that combines rule-based heuristics with AI gap-filling, an intelligent categorizer, and a context-aware conversational AI chatbot with memory. Results are displayed in a premium, responsive Streamlit web interface that enables interactive field conflict resolution, chat functionality, and data exports.

image image

Key Features

  1. Multi-Format Ingestion:
    • PDF support with Poppler and PyMuPDF fallback (pure Python, zero binary dependency).
    • DOCX support with dynamic text and table renderer to PIL canvas.
    • Standard Image support (PNG, JPG, JPEG).
  2. Unified OCR Wrapper:
    • Select between PaddleOCR, EasyOCR, or PyTesseract.
    • Automatic cascade fallback (PaddleOCR ➔ EasyOCR ➔ PyTesseract) to ensure the system runs immediately on any environment.
  3. Dual-Engine Extraction & Conflict Resolution:
    • Rule-Based: Heuristic, Regex, & Proximity Field Matchers detect fields (Invoice Number, Dates, GSTIN, Amounts) and map tables using OpenCV morph kernels.
    • AI-Based: Uses Large Language Models (LLMs) to independently extract fields, automatically filling gaps missed by the rule engine.
    • Resolution Panel: Side-by-side UI to compare rule-based vs. AI-extracted values and manually resolve discrepancies.
image
  1. Context-Aware AI Chatbot:
    • Dedicated WhatsApp-style chat panel to interrogate your invoices.
    • Automatically injects document metadata, extracted fields, table grid text, and raw OCR text into the LLM system prompt for granular awareness.
    • Maintains conversation history (up to 20 messages).
  2. Intelligent Categorization & Risk Management:
    • Classifies invoices into 9 business categories and assigns expense tags using keyword strategies or LLM wrappers.
    • Risk Indicators: Warns about missing critical fields (e.g., due dates) and flags potential duplicate invoices.
  3. Multi-Provider LLM Wrapper & Database Backing:
    • Seamlessly plug in OpenAI (gpt-4o-mini), Google Gemini (gemini-1.5-flash), or Anthropic (claude-3-haiku).
    • Persistent storage using SQLAlchemy (defaults to SQLite, supports PostgreSQL).

Technology Stack

  • Python 3.11+
  • Streamlit (Web Application Interface & Chat UI)
  • OpenAI / Google Gemini / Anthropic APIs (LLM Integrations)
  • SQLAlchemy (Database ORM)
  • OpenCV (Morphological analysis and line extraction)
  • NumPy & Pandas (Data structures and tables)
  • Pillow (Image operations and alpha compositing)
  • python-docx & pdf2image & PyMuPDF (fitz) (Document parsing/rendering)
  • pytesseract, easyocr, paddleocr (OCR execution options)

Directory Structure

Invoice/
├── app.py                      # Core Streamlit Web Application Dashboard
├── requirements.txt            # Python Package Dependencies
├── config.py                   # Centralized Configuration (colors, regex, keywords)
├── .env                        # API Keys and Environment Variables
├── README.md                   # Installation & Setup documentation
├── modules/
│   ├── converter.py            # PDF/DOCX to Image Conversion Pipeline
│   ├── ocr_engine.py           # Unified OCR Engine Wrapper with auto-fallbacks
│   ├── field_detector.py       # Heuristic, Regex & Keyword Field Detection
│   ├── table_detector.py       # OpenCV Table Grid & Cell Mapping
│   └── bbox_drawer.py          # Bounding Box Overlays & Labels Canvas Drawer
├── services/
│   ├── invoice_service.py      # Core CRUD and Ingestion pipeline logic
│   ├── llm_service.py          # Multi-provider LLM API router (OpenAI/Gemini/Anthropic)
│   ├── categorization_service.py # Strategy pattern for auto-tagging
│   └── chat_service.py         # Context injection and conversation history manager
├── ui/
│   ├── sidebar.py              # File uploads, filters, and history lists
│   ├── invoice_viewer.py       # Document preview, conflict resolution, and data tabs
│   └── chat_panel.py           # Conversational AI interface
├── database/                   # SQLAlchemy models and schemas
├── tests/                      # Comprehensive unittest test suite
├── storage/                    # Persistent storage (invoices, annotations)
├── uploads/ & outputs/         # Temporary processing directories
└── data/                       # SQLite database location

Setup & Installation

1. Prerequisites (External Binaries)

For PDF parsing and Tesseract OCR to run correctly, install the system binaries:

macOS (via Homebrew)

# Install Poppler (for pdf2image)
brew install poppler

# Install Tesseract (for pytesseract fallback)
brew install tesseract

Ubuntu/Debian

# Install Poppler & Tesseract
sudo apt-get update
sudo apt-get install -y poppler-utils tesseract-ocr libtesseract-dev

Windows

  1. Poppler: Download poppler for Windows (e.g. from github/poppler-windows), extract it, and add the bin/ folder to your system PATH.
  2. Tesseract: Download the installer from UB Mannheim Tesseract, run installation, and add C:\Program Files\Tesseract-OCR to your system PATH.

2. Python Virtual Environment & Packages

# Clone or navigate to the project directory
cd Invoice

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows (Command Prompt):
# venv\Scripts\activate.bat

# Upgrade pip
pip install --upgrade pip

# Install dependencies
pip install -r requirements.txt

Note: If you wish to use PaddleOCR, install paddlepaddle and paddleocr manually if they compile on your platform:

# Install PaddlePaddle CPU (or GPU version)
pip install paddlepaddle
# Install PaddleOCR
pip install paddleocr

If PaddleOCR fails to install due to compilation issues on macOS Apple Silicon, the system will seamlessly run EasyOCR or PyTesseract instead.


3. API Key Configuration

To enable the AI capabilities (Chatbot, LLM Extraction, Smart Categorization), configure your API keys.

Create a .env file in the root directory:

cp .env.example .env

Open .env and add your API key for at least one of the supported providers:

OPENAI_API_KEY=YOUR_OPENAI_KEY
GOOGLE_API_KEY=YOUR_GEMINI_KEY
ANTHROPIC_API_KEY=YOUR_CLAUDE_KEY

(The system will automatically detect which keys are available and use the corresponding models).


Running the Application

1. Run the Test Suite

Ensure all modules compile and pass their unit assertions:

python -m unittest tests/test_pipeline.py

2. Launch the Streamlit Dashboard

streamlit run app.py

Open http://localhost:8501 in your browser. Upload a document from the sidebar to process it through the dual-engine pipeline, resolve fields, and start chatting with your invoice!

About

InvoiceAI is an advanced invoice processing and data extraction platform built with Python, Streamlit, OpenCV, and SQLAlchemy. It features a robust ingestion pipeline for PDFs, DOCX, and images that uses a dual-engine extraction strategy—combining deterministic OpenCV/OCR layout rules with AI-driven gap-filling. It includes an interactive conflict

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages