QuantForge AI is an interactive platform for benchmarking, comparing, and analyzing LLM quantization techniques including GPTQ, AWQ, QLoRA, and custom quantization pipelines.
QuantForge AI provides a comprehensive suite of tools for evaluating and comparing different quantization methods on large language models. The platform enables researchers and engineers to:
- Benchmark multiple quantization techniques (GPTQ, AWQ, QLoRA)
- Compare model performance across different bit-widths
- Visualize results with interactive dashboards
- Export benchmark results for analysis
- Implement custom quantization research
- GPTQ Benchmarking - Evaluate GPTQ quantization performance
- AWQ Benchmarking - Test AWQ quantization methods
- QLoRA Benchmarking - Assess QLoRA fine-tuning quantization
- Quantization Comparison Dashboard - Interactive comparison of methods
- HuggingFace Model Loader - Seamless integration with HF models
- Interactive Streamlit UI - User-friendly web interface
- Benchmark Visualization - Plotly-based charts and graphs
- Result Export - Export results in JSON format
- Performance Analytics - Detailed performance metrics
- Custom Quantization Research - Implement novel quantization methods
QuantForge-AI/
├── app.py # Streamlit dashboard
├── src/
│ ├── model_loader.py # HuggingFace model integration
│ ├── benchmark.py # Benchmark engine
│ ├── comparison.py # Quantization comparison engine
│ ├── visualization.py # Plotly visualization
│ └── quantforge_quant.py # Custom ALAQ quantization
├── results/ # Benchmark results storage
├── assets/ # Static assets
├── docs/ # Documentation
└── requirements.txt # Python dependencies
# Clone the repository
git clone https://github.com/yourusername/QuantForge-AI.git
cd QuantForge-AI
# Install dependencies
pip install -r requirements.txt# Launch the Streamlit dashboard
streamlit run app.pyThe dashboard includes the following tabs:
- Dashboard - Overview of selected model and quantization metrics
- Model Loader - Load and configure models from HuggingFace
- Quantization - Apply quantization methods to loaded models
- Benchmark - Run benchmark tests on quantized models
- Comparison - Compare multiple quantization methods side-by-side
- Reports - View and export benchmark results
- Selected Model
- Quantization Method
- VRAM Usage
- Inference Speed
- Accuracy
- Latency
The benchmark engine supports evaluation on multiple datasets:
- WikiText2
- C4
- PTB
- Pile
- Custom datasets
Metrics collected:
- Perplexity (PPL)
- Inference latency
- Memory usage
- Throughput
Compare between quantization methods:
- GPTQ - Post-training quantization
- AWQ - Activation-aware weight quantization
- QLoRA - Quantized LoRA fine-tuning
- ALAQ - Adaptive Layer-Aware Quantization (custom)
Comparison metrics:
- Model Size
- VRAM Usage
- Inference Speed
- Latency
- Accuracy
- Perplexity
Results are stored in JSON format in the results/ directory.
- Meta-Llama-3-8B
- Mistral-7B
- Gemma-7B
- Additional models can be added via configuration
from src.model_loader import load_model, load_tokenizer, list_supported_models
# List available models
models = list_supported_models()
# Load a model
model = load_model("meta-llama/Meta-Llama-3-8B")
# Load tokenizer
tokenizer = load_tokenizer("meta-llama/Meta-Llama-3-8B")QuantForge AI includes a novel quantization method:
ALAQ dynamically assigns bit-widths based on layer importance:
- Important layers → 8-bit quantization
- Medium layers → 6-bit quantization
- Less important layers → 4-bit quantization
Layer importance is determined through sensitivity analysis during calibration.
- Add support for more quantization methods (SpQR, SmoothQuant)
- Expand model support (Llama-2, Falcon, Mixtral)
- Implement distributed benchmarking
- Add API endpoints for programmatic access
- Create Docker container for easy deployment
- Add collaborative benchmark sharing
Anany Tripathi