Production-grade fine-tuning platform for large language models from 2B to 70B parameters. Implements QLoRA (4-bit NF4 quantization + LoRA adapters) to make fine-tuning accessible without enterprise-grade hardware, with a guided 5-step pipeline and live training dashboard.
- The Problem
- Key Result
- Engineering Design
- Architecture
- Experimental Setup
- Results
- LoRA vs Full Fine-Tune
- Features
- Installation
- Usage
- The 5-Step Pipeline
- Repository Structure
- Related Work
- Citation
Fine-tuning a 70B LLM with standard full fine-tuning requires updating every one of its parameters on each backward pass. On consumer or mid-range hardware, this is intractable. The VRAM requirements alone rule out most practitioners.
The alternative is not to give up on fine-tuning. It is to ask which parameters actually matter.
QLoRA answers that question. AutoLLM Forge operationalizes the answer into a platform.
LoRA fine-tuning on GPT-2 with 0.24% of parameters (295K / 124M) matched full fine-tuning on generalization - perplexity 16.13 vs 20.41 - while running 3.5x faster and using 421x fewer trainable parameters.
Full fine-tuning overfit on the small dataset (loss dropped to 0.80) while LoRA's low-rank constraint acted as a regularizer, achieving better held-out perplexity despite higher training loss. This is the tradeoff that makes LoRA the default for VRAM-constrained environments.
The core insight: efficient fine-tuning is not about compute. It is about which parameters actually matter. QLoRA freezes base model weights and trains only low-rank adapter matrices, reducing the trainable parameter count by orders of magnitude. On GPU, 4-bit NF4 quantization additionally reduces the base model's memory footprint by ~75% compared to FP16 full fine-tuning.
Base Model (frozen)
|
v
+-----------------+
| Quantization | 4-bit NF4 quantization (bitsandbytes)
| (QLoRA) | ~75% VRAM reduction vs FP16 full fine-tune (GPU)
+--------+--------+
|
v
+-----------------+
| LoRA Adapters | Low-rank matrices injected into attention layers
| (PEFT) | Only adapters are trained, base weights frozen
+--------+--------+
|
v
+-----------------+
| Training Loop | Real-time loss, VRAM, throughput monitoring
| (PyTorch) | WebSocket streaming to Next.js dashboard
+--------+--------+
|
v
+-----------------+
| Artifact Export| Merged model, LoRA weights, config templates
| | Ready for inference deployment
+-----------------+
| Layer | Technology |
|---|---|
| Fine-Tuning | QLoRA, PEFT, bitsandbytes |
| Model Framework | PyTorch, HuggingFace Transformers |
| Backend | FastAPI, WebSockets |
| Frontend | Next.js, TypeScript |
| Streaming | Server-Sent Events |
Comparison: LoRA vs Full Fine-Tune on GPT-2 (124M parameters), 10 prompt-engineering samples. Both runs used identical settings. The only variable was use_lora.
| Parameter | Value |
|---|---|
| Model | GPT-2 (124,440,576 parameters) |
| Dataset | 10 prompt-engineering samples |
| Epochs | 3 |
| Learning Rate | 5e-5 (cosine scheduler, 5 warmup steps) |
| Batch Size | 1 (gradient accumulation: 2) |
| Seed | 42 |
| LoRA Config | r=8, alpha=16, target modules: c_attn |
| Hardware | CPU (torch 2.11.0+cpu) |
| Metric | Value |
|---|---|
| Final Train Loss | 8.5132 |
| Perplexity | 16.127 |
| Training Time | 101.8 seconds |
| Trainable Parameters | 294,912 / 124,440,576 (0.24%) |
| Samples/sec | 0.295 |
| Steps/sec | 0.147 |
| Total Steps | 15 (3 epochs x 5 steps/epoch) |
| Step | Loss | Learning Rate |
|---|---|---|
| 1 | 8.3376 | 0.0 (warmup) |
| 2 | 8.3794 | 1e-5 |
| 5 | 8.5533 | 4e-5 (peak warmup) |
| 6 | 8.4633 | 5e-5 (peak LR) |
| 10 | 8.0822 | 3.27e-5 |
| 15 | 8.5937 | 1.22e-6 (end) |
Each training run produces:
storage/outputs/{job_id}/
+-- final_model/ # LoRA adapter weights
+-- training_metrics.json # Loss, runtime, throughput
+-- model_card.json # Config, dataset stats, evaluation
storage/experiments/{job_id}/
+-- loss.png # Training loss graph
+-- metrics.jsonl # Per-step metrics log
+-- metadata.json # Ablations, environment
Both runs: seed 42, 3 epochs, LR 5e-5, cosine scheduler. Only use_lora differed.
| Metric | LoRA (0.24% params) | Full Fine-Tune (100% params) | Delta |
|---|---|---|---|
| Training Time | 101.8s | 355.0s | 3.5x faster |
| Final Loss | 8.513 | 2.957 | n/a |
| Perplexity | 16.13 | 20.41 | 21% better |
| Samples/sec | 0.295 | 0.085 | 3.5x faster |
| Total FLOPs | 7.87T | 7.84T | ~identical |
| Trainable Params | 295K | 124M | 421x fewer |
| Grad Norm (start) | 2.28 | 180.53 | 79x smaller |
| Grad Norm (end) | 2.17 | 4.44 | 2x smaller |
| Step | LoRA Loss | Full FT Loss |
|---|---|---|
| 1 | 8.338 | 8.366 |
| 3 | 8.633 | 7.493 |
| 5 | 8.553 | 3.609 |
| 7 | 9.156 | 1.178 |
| 10 | 8.082 | 1.104 |
| 15 | 8.594 | 0.795 |
Speed. LoRA ran 3.5x faster because only 0.24% of parameters needed gradients. On GPU this gap narrows but remains significant.
Loss vs generalization. Full fine-tune achieved lower training loss (2.96 vs 8.51) by memorizing the 10-sample dataset - loss dropped to 0.80 by step 15. LoRA's low-rank constraint prevented memorization. Held-out perplexity penalizes overfitting, which is why LoRA wins on the metric that matters.
Gradient stability. LoRA gradients stayed in a healthy range (1.9-2.6) throughout. Full fine-tune started with a gradient norm of 180+ before settling. LoRA is more numerically stable and requires less careful LR tuning in practice.
Memory. 421x fewer trainable parameters means LoRA fits on hardware where full fine-tuning is not an option. The 295K adapter parameters would train on a free-tier T4 GPU.
Bottom line. On small datasets, LoRA is faster, more stable, and generalizes better. On larger datasets (1000+ samples), full fine-tuning closes the perplexity gap, but LoRA remains the default choice for VRAM-constrained environments.
| Consideration | Winner |
|---|---|
| Speed / Efficiency | LoRA (3.5x faster, 421x fewer params) |
| Final training loss | Full FT (2.96 vs 8.51) |
| Generalization | LoRA (better perplexity on small data) |
| Gradient stability | LoRA (2.28 vs 180.53 initial grad norm) |
| Memory footprint | LoRA (295K vs 124M trainable params) |
| Feature | Description |
|---|---|
| QLoRA Fine-Tuning | 4-bit NF4 quantization with LoRA adapters via PEFT |
| Smart Hyperparameters | Automated defaults based on model size and dataset |
| Real-Time Monitoring | Live loss curves, VRAM usage, and throughput via WebSockets |
| Dataset Validation | Automated format checking and preprocessing |
| Artifact Export | Merged model weights, LoRA adapters, and inference configs |
| Model Browser | Search and load 1000+ HuggingFace models directly |
| Multi-Format Support | Instruction tuning, completion, and chat template formats |
git clone https://github.com/royxlead/autollmforge-python.git
cd autollmforge-python
# Backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Frontend
cd frontend && npm installCore dependencies: PyTorch · HuggingFace Transformers · PEFT · bitsandbytes · FastAPI · Next.js
# Terminal 1 - backend
python run.py
# Terminal 2 - frontend
cd frontend && npm run devOpen http://localhost:3000 and launch the workspace. The platform loads any HuggingFace-compatible model, validates your dataset, and walks through the 5-step pipeline with live monitoring.
01 Inspect - Select and analyze your base model. View parameter count, architecture, and estimated VRAM requirements before committing to a run.
02 Prepare - Upload and validate your training dataset. Automated format detection handles instruction tuning, completion, and chat template formats.
03 Optimize - Configure QLoRA parameters with smart defaults derived from model size and dataset. Override any hyperparameter manually.
04 Train - Real-time dashboard with live loss curves, VRAM profiling, and throughput metrics streamed via WebSockets.
05 Ship - Export merged model weights, standalone LoRA adapters, and inference code templates ready for deployment.
autollmforge-python/
|
+-- run.py # Entry point
+-- backend/ # FastAPI server, training loop, QLoRA logic
| +-- main.py # API endpoints
| +-- services/ # Training, model analysis, hyperparameter optimization
| +-- models/schemas.py # Pydantic request/response types
| +-- utils/ # Compute estimation, HF utilities, logging
+-- frontend/ # Next.js dashboard, WebSocket client
+-- storage/
| +-- outputs/ # Per-job model artifacts and metrics
| +-- experiments/ # Loss plots and per-step logs
+-- requirements.txt
+-- LICENSE
- Auto-Researcher - Multi-agent academic research system
- CURA - RAG-based medical QA
- Self-Diagnosing Neural Models - Uncertainty estimation for model outputs
- DriftWatch - Production drift monitoring for fine-tuned models
AutoLLM Forge sits at the beginning of this pipeline: fine-tune carefully with QLoRA, then quantify uncertainty in deployment via Self-Diagnosing Neural Models and monitor for distribution shift via DriftWatch.
@software{roy2025autollmforge,
author = {Roy, Sourav},
title = {AutoLLM Forge: Efficient LLM Fine-Tuning Platform with QLoRA},
year = {2025},
url = {https://github.com/royxlead/autollmforge-python}
}Built by Sourav Roy · Founding AI/ML Engineer · Yuga AI