DeepSeek-R1 7B INT4 at 69.3 tok/s on a $300 RTX 3060. Faster than llama.cpp, vLLM, and NVIDIA TensorRT-LLM. Is one developer + Ai really better than the entire industry?
inference-engine cachyos local-llm speculative-decoding deepseek-r1 cuda-optimization rtx-3060 w4a16
-
Updated
May 19, 2026 - Python