Skip to content

Kylefan123/aws-document-intelligence-platform

Repository files navigation

AWS Document Intelligence Platform

Upload documents, store them securely in AWS S3, process them asynchronously with a Lambda-compatible worker, and explore extracted text and metadata through a real-time React dashboard.

CI Python FastAPI React TypeScript Terraform License: MIT


Table of Contents


Overview

DocIntel is a production-style, full-stack cloud application that demonstrates end-to-end software engineering across the backend, frontend, cloud infrastructure, and DevOps layers.

Users register, log in, and upload documents (PDF, DOCX, TXT, or images). Files are stored in AWS S3, a processing job is dispatched asynchronously, and the system extracts text, counts words, detects language, and generates a plain-text summary. The React dashboard polls for status in real time and renders the results when ready.

Key engineering decisions:

  • The document processor runs as a background thread locally (USE_LOCAL_WORKER=true) and as an AWS Lambda function in production — the same logic in both places.
  • LocalStack mocks AWS S3 in the local Docker Compose stack, so no AWS account is needed to develop or test.
  • SQLite in-memory is used for the test suite; PostgreSQL is used in development and production via Docker Compose.
  • Infrastructure is fully defined in Terraform, covering S3, IAM roles, Lambda, and CloudWatch.

Architecture

graph TD
    Browser["🖥️ Browser\nReact + TypeScript"]
    API["⚙️ FastAPI\nREST API + JWT Auth"]
    DB[("🗄️ PostgreSQL\nUsers · Documents")]
    S3[("☁️ AWS S3\nDocument Storage")]
    Lambda["λ Lambda Worker\nAsync Processor"]

    Browser -->|"HTTPS / JWT"| API
    API -->|"SQLAlchemy ORM"| DB
    API -->|"boto3 upload"| S3
    API -->|"invoke (prod)\nor thread (dev)"| Lambda
    Lambda -->|"boto3 download"| S3
    Lambda -->|"PATCH status + text"| API

    subgraph Local["Local Dev (Docker Compose)"]
        direction TB
        DB
        LocalStack["🧪 LocalStack\nS3 mock on :4566"]
    end

    subgraph AWS["AWS Production"]
        direction TB
        S3
        Lambda
        CW["📋 CloudWatch Logs"]
        Lambda --> CW
    end
Loading

Request Flow

Step What happens
1 User registers or logs in → receives a JWT
2 User uploads a file → API validates size/type, streams to S3, creates a DB record (status=pending)
3 API triggers the processor (background thread in dev, Lambda invoke in production)
4 Processor downloads the file from S3, extracts text and metadata
5 DB record updated (status=completed) with word count, page count, language, summary, and extracted text
6 Frontend polls every 3 s and renders results when processing finishes

Features

Feature Detail
JWT Authentication Register + login with bcrypt-hashed passwords; stateless Bearer tokens
Document Upload Drag-and-drop modal with upload progress bar; 50 MB limit
Supported Formats PDF, DOCX, TXT, PNG, JPG/JPEG
Async Processing Background thread (dev) or AWS Lambda (prod); status polling from the UI
Text Extraction Full extracted text via PyPDF, python-docx, and Pillow
Metadata Analysis Word count, page count, language detection, plain-text summary
Real-Time Dashboard Stats cards, document table, live status badge updates
Document Detail Full extracted text, summary, metadata; presigned S3 download URL
Delete Documents Removes DB record and S3 object atomically
AWS Infrastructure S3 (encrypted, versioned), IAM least-privilege roles, Lambda, CloudWatch
Local AWS Mock LocalStack emulates S3 — no AWS account needed for local dev
Seed Data make seed loads four sample documents (various statuses)
CI Pipeline GitHub Actions: ruff lint, mypy type check, 25 pytest tests, frontend build, Trivy security scan

Tech Stack

Layer Technology
Frontend React 18, TypeScript 5, Vite, Tailwind CSS, Axios, React Router v6
Backend FastAPI 0.115, Python 3.11/3.12, SQLAlchemy 2.0, Alembic, Pydantic v2
Auth JWT via python-jose, bcrypt via passlib
Database PostgreSQL 16 (prod/dev) · SQLite in-memory (tests)
Cloud Storage AWS S3 · LocalStack (local dev)
Document Processing PyPDF, python-docx, Pillow
Async Worker AWS Lambda (Python 3.12) · background thread (local mode)
Infrastructure Terraform ≥ 1.6
Containers Docker, Docker Compose
CI/CD GitHub Actions
Testing pytest 8, pytest-cov, httpx TestClient, SQLite in-memory fixtures
Linting / Types ruff, mypy, ESLint, TypeScript strict mode

Project Structure

aws-document-intelligence-platform/
│
├── .github/
│   └── workflows/ci.yml          # Lint → type check → test → build → security scan
│
├── frontend/                     # React + TypeScript + Vite
│   ├── src/
│   │   ├── components/           # Reusable UI (StatusBadge, Navbar)
│   │   ├── pages/                # Auth pages, Dashboard, Document detail
│   │   ├── services/api.ts       # Axios client with JWT interceptor
│   │   ├── hooks/useAuth.ts      # Auth state and helpers
│   │   ├── types/index.ts        # Shared TypeScript interfaces
│   │   └── utils/format.ts       # formatBytes, formatDate, status helpers
│   ├── Dockerfile                # Multi-stage: dev server + Nginx production image
│   └── package.json
│
├── backend/                      # FastAPI application
│   ├── app/
│   │   ├── api/v1/endpoints/     # auth.py · documents.py
│   │   ├── core/                 # config.py · database.py · security.py
│   │   ├── models/               # SQLAlchemy ORM: User · Document
│   │   ├── schemas/              # Pydantic request/response models
│   │   └── services/             # s3_service · document_processor · auth_service
│   ├── tests/
│   │   ├── unit/                 # Security and processor unit tests
│   │   └── integration/          # Auth and document API tests (25 total)
│   ├── alembic/                  # Database migration scripts
│   ├── Dockerfile
│   └── requirements*.txt
│
├── worker/                       # AWS Lambda document processor
│   ├── lambda_function.py        # Lambda handler (S3 trigger + direct invoke)
│   └── processor.py              # Core logic; no FastAPI/SQLAlchemy dependency
│
├── terraform/                    # AWS infrastructure as code
│   ├── main.tf                   # Provider + backend config
│   ├── variables.tf / outputs.tf
│   ├── s3.tf                     # Bucket: versioning, encryption, lifecycle, S3 trigger
│   ├── iam.tf                    # Lambda execution role + backend policy (least-privilege)
│   └── lambda.tf                 # Function, S3 event trigger, CloudWatch log group
│
├── scripts/
│   ├── seed_data.py              # Load sample documents into local DB
│   └── localstack-init.sh        # Bootstrap S3 bucket inside LocalStack
│
├── docker-compose.yml            # PostgreSQL + Backend + Frontend + LocalStack
├── Makefile                      # Developer commands (make dev, test, lint, seed…)
├── .env.example                  # All environment variables with descriptions
└── README.md

Local Setup

Prerequisites

Install the following:

Tool Purpose
Docker Desktop Runs the full local stack
Node.js 20+ Frontend development and builds
Python 3.11 or 3.12 Backend scripts and tests
Make Convenience commands
Git Version control

1. Clone the Repository

git clone https://github.com/Kylefan123/aws-document-intelligence-platform.git
cd aws-document-intelligence-platform

2. Create the Environment File

cp .env.example .env

The default values in .env.example are enough for local Docker development.


3. Start Docker Desktop

Open Docker Desktop and wait until the engine is running.


4. Start the Full Stack

make dev

This starts the full Docker Compose environment:

Service Local URL
Frontend http://localhost:5173
Backend API http://localhost:8000
API Docs http://localhost:8000/docs
LocalStack http://localhost:4566
PostgreSQL localhost:5432

Keep this terminal open because it displays the live Docker logs.


5. Run Database Migrations

Open a second terminal tab from the project root and run:

make migrate

6. Optional: Load Sample Data

You can either create an account manually in the UI or load sample users and documents.

From the project root:

cd backend
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
cd ..
backend/.venv/bin/python scripts/seed_data.py

Demo account:

Email: alice@example.com
Password: password123

Second demo account:

Email: bob@example.com
Password: password123

7. Open the Application

Frontend:

http://localhost:5173

Backend Swagger docs:

http://localhost:8000/docs

API Documentation

FastAPI automatically generates interactive Swagger documentation at:

http://localhost:8000/docs

The API includes endpoints for:

Area Endpoints
Auth Register, login, current user
Documents Upload, list, stats, detail, delete, download URL
Health Health check

Example API groups:

POST   /api/v1/auth/register
POST   /api/v1/auth/login
GET    /api/v1/auth/me
POST   /api/v1/documents/upload
GET    /api/v1/documents/
GET    /api/v1/documents/stats
GET    /api/v1/documents/{document_id}
DELETE /api/v1/documents/{document_id}
GET    /api/v1/documents/{document_id}/download-url
GET    /health

Running Tests

Backend tests use SQLite in-memory and mocked services, so they do not require a running AWS account.

cd backend
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
pytest tests/ -v

Run tests with coverage:

pytest tests/ -v --cov=app --cov-report=term-missing

Linting and Type Checking

Backend:

cd backend
ruff check app/ tests/
mypy app/ --ignore-missing-imports

Frontend:

cd frontend
npm install
npm run build
npm run lint

Docker Commands

Start the full stack:

make dev

Start services in detached mode:

make up

Stop services:

make down

View running services:

docker compose ps

View backend logs:

docker compose logs backend --tail=100

Clean containers and volumes:

make clean-docker

AWS Deployment Notes

This repository includes Terraform definitions for an AWS deployment architecture.

Terraform covers:

  • S3 bucket
  • Bucket encryption
  • Bucket versioning
  • IAM roles and policies
  • Lambda worker
  • CloudWatch logs
  • S3-to-Lambda event configuration

Important: this repository is AWS-ready, but it does not claim to be currently deployed to AWS unless a live deployment URL is added.

Do not run Terraform deployment commands unless you are ready to create real AWS resources and review possible costs.

Package the Lambda worker:

make lambda-package

Preview infrastructure:

make terraform-init
make terraform-plan

Apply infrastructure only after reviewing the plan:

make terraform-apply

Destroy AWS resources when finished:

make terraform-destroy

Local Development vs AWS Deployment

Area Local Development AWS Deployment Architecture
Frontend Vite dev server in Docker Static hosting or container deployment
Backend FastAPI container API service deployment
Database PostgreSQL Docker container Managed PostgreSQL or equivalent
Object Storage LocalStack S3 mock AWS S3
Worker Local background processor AWS Lambda-compatible worker
Infrastructure Docker Compose Terraform
Logs Docker logs CloudWatch logs

Environment Variables

The project uses .env.example as a template.

Important variables:

Variable Purpose
APP_ENV Development or production environment
APP_SECRET_KEY Application secret key
DATABASE_URL SQLAlchemy database URL
POSTGRES_USER PostgreSQL username
POSTGRES_PASSWORD PostgreSQL password
POSTGRES_DB PostgreSQL database name
JWT_SECRET_KEY Secret used for JWT signing
AWS_ACCESS_KEY_ID AWS access key or LocalStack placeholder
AWS_SECRET_ACCESS_KEY AWS secret key or LocalStack placeholder
AWS_DEFAULT_REGION AWS region
AWS_S3_BUCKET_NAME S3 bucket name
LAMBDA_FUNCTION_NAME Lambda worker name
USE_LOCAL_WORKER Enables local background processing
VITE_API_BASE_URL Frontend API base URL

For local development, the default .env.example values are enough to run the app with Docker Compose.


Screenshots

Add screenshots to the screenshots/ folder using the filenames below.

For public screenshots, use demo accounts and fake documents only. Do not include real resumes, private emails, school documents, transcripts, or personal information.

Login Page

Login Page

Create Account Page

Create Account Page

Dashboard

Dashboard

Upload Modal

Upload Modal

Document Detail Page

Document Detail

FastAPI Swagger Docs

API Docs

Docker Desktop Containers

Docker Containers

GitHub Actions CI

GitHub Actions


Resume Bullet Points

  • Built a full-stack document intelligence platform using React, TypeScript, FastAPI, PostgreSQL, Docker, and S3-compatible object storage to support document upload, asynchronous processing, metadata extraction, and dashboard visualization.
  • Designed a cloud-ready architecture with LocalStack-based S3 emulation, Lambda-compatible document processing, Terraform infrastructure definitions, and GitHub Actions CI/CD.
  • Implemented JWT authentication, document status tracking, file validation, extracted text rendering, and automated backend testing across unit and integration suites.

Future Improvements

  • Amazon Textract — higher-accuracy OCR for scanned PDFs and complex layouts
  • Amazon Bedrock / OpenAI — LLM-powered summarisation and document Q&A
  • Full-text search — PostgreSQL tsvector or pgvector for semantic search
  • SQS queue — decouple API from Lambda for resilience under load
  • Email notifications — SNS or SES alerts when processing completes
  • Batch upload — upload and process multiple files in one request
  • AWS App Runner / ECS Fargate — managed container deployment for the backend
  • Admin panel — user management, system-wide document stats
  • Document sharing — generate shareable links with expiry
  • Webhook support — push processing results to external systems

License

MIT © Kyle Theodore — see LICENSE for details.

About

Full-stack AWS document intelligence platform with FastAPI, React, PostgreSQL, S3, Lambda-style processing, Terraform, Docker, CI/CD, and tests.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors