Turn a local llama.cpp checkout into something you can actually use:
build it, launch models, inspect GPU pressure, recover orphaned servers, and control everything from a browser or terminal.
Why | Screenshots | Start Here | UI | Technical Reference
- You do not have to manually juggle
llama.cppbuilds,llama-serverlaunch commands, VRAM checks, and logs across separate scripts. - The web UI gives you one place to build, launch, monitor, and recover running instances.
- If the backend restarts or crashes, it can re-adopt repo-launched
llama-serverprocesses instead of losing control of them. - The browser UI is layered on top of the same local tools, so terminal users and UI users are working against the same repo, binaries, and models.
- One repo checkout.
- One Python environment.
- One backend command to bring up the control plane.
- One token for the browser UI.
- The backend can serve the built frontend directly, so normal users do not need to run a separate frontend dev server.
The main control-plane page shows backend health, fleet status, host load, GPU pressure, and recent activity without making you dig through logs first.
Expandable GPU detail shows compute load, VRAM use, and which managed processes currently own memory on each device.
The Instances page gives you a proper launcher for llama-server and a way to
recover servers that survived a backend restart.
The Builds page wraps autodevops.py with real options, history, command
preview, and logs.
The Benchmarks page runs llama-bench, keeps the command and logs, and stores
parsed throughput results. The capture below is a live run of
unsloth/Qwen3.5-0.8B-GGUF:Q4_K_XL pinned to the RTX 4060.
The Memory page gives you a quick VRAM and RAM view before you launch a model.
The Library page shows local GGUFs and pulls new ones straight from Hugging Face.
The same control plane also works on narrow screens.
This is the simplest browser-first path for a normal user.
git clone https://github.com/CesarPetrescu/llama-cpp-autodeploy.git
cd llama-cpp-autodeploypython3 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txtcd web/frontend
npm install
npm run build
cd ../..That gives the backend a production frontend to serve at /.
python web_cli.py --init
python web_cli.pyWhat happens here:
python web_cli.py --initcreates.web_config.jsonand prints a bearer token.python web_cli.pystarts the backend on port8787by default.- If
web/frontend/dist/exists, the backend also serves the frontend UI.
Open http://localhost:8787.
On first use:
- go to Settings
- paste the bearer token printed during
--init - save it once
After that, the app can build llama.cpp, launch models, show logs, inspect
GPU pressure, and recover managed instances from the browser.
If you are editing the UI instead of just using it:
cd web/frontend
npm run devThat starts the Vite frontend at http://localhost:5173 and proxies API
requests to the backend at http://127.0.0.1:8787.
| Page | What it is for |
|---|---|
| Dashboard | See backend health, host CPU/RAM/load, GPU pressure, builds, and fleet state |
| Instances | Create, recover, start, stop, restart, and delete llama-server processes |
| Instance logs | Watch live stdout with pause/resume |
| Memory | Estimate placement and VRAM needs before launch |
| Library | Scan local GGUFs and download new ones from Hugging Face |
| Builds | Run autodevops.py, inspect supported options, and stream logs |
| Benchmarks | Run llama-bench, pin tests to specific GPUs, and keep structured throughput history |
| Settings | Set backend URL and bearer token |
| Layer | Role |
|---|---|
autodevops.py |
Build local llama.cpp binaries |
loadmodel.py |
Launch llama-server and reranker processes |
memory_utils.py |
Probe VRAM, RAM, and placement estimates |
web/backend/ |
Auth, state, logs, recovery, and API surface |
web/frontend/ |
Browser UI for overview, builds, instances, memory, and library |
- Linux with Python 3.10+.
- Build tools for
llama.cpp:git,cmake,make,gcc,g++,pkg-config. - NVIDIA drivers and CUDA toolkit if you want CUDA builds or GPU runtime.
- Optional BLAS libraries:
- Intel MKL for
--blas mkl - OpenBLAS for
--blas openblas
- Intel MKL for
Python dependencies are in requirements.txt.
Interactive build flow:
python autodevops_cli.pyNon-interactive build flow:
python autodevops.py --help
python autodevops.py --ref latest --nowSupported build flags:
| Flag | Meaning |
|---|---|
| `--ref <tag | branch |
--now |
Build immediately instead of waiting for the scheduled path |
--fast-math |
Pass fast-math CUDA flags to NVCC |
--force-mmq {auto,on,off} |
Control MMQ CUDA kernels |
--blas {auto,openblas,mkl,off} |
Choose the CPU BLAS backend |
--distributed |
Build GGML RPC support |
--cpu-only |
Skip NVIDIA driver prechecks |
Interactive launcher:
python loadmodel_cli.pyUnified launcher:
python loadmodel.py --helploadmodel.py supports three mutually exclusive modes:
| Mode | Result |
|---|---|
--llm |
Start ./bin/llama-server for completion/chat |
--embed |
Start ./bin/llama-server for embeddings |
--rerank |
Start the Transformers reranker HTTP service |
Examples:
# LLM (local GGUF)
python loadmodel.py --llm ./models/model.gguf --port 45540
# Embeddings (download GGUF from HF repo, auto-select quant/file)
python loadmodel.py --embed Qwen/Qwen3-Embedding-8B-GGUF:Q8_0 --port 45541
# Reranker HTTP server
python loadmodel.py --rerank Qwen/Qwen3-Reranker-8B --host 127.0.0.1 --port 45542For MoE-capable llama-server builds, loadmodel.py also accepts:
--cpu-moe--n-cpu-moe <N>
If the local llama-server binary does not expose these flags,
loadmodel.py exits with a rebuild hint.
For recent llama.cpp builds with MTP support, both the web Instances page and
the terminal launchers expose structured speculative decoding fields. The
minimal launch form is:
python loadmodel.py --llm ./models/Qwen3.6-35B-A3B-UD-IQ1_M.gguf \
--spec-type draft-mtpWhen the draft/MTP GGUF is a separate file, provide it explicitly:
python loadmodel.py --llm ./models/main.gguf \
--spec-type draft-mtp \
--spec-draft-model ./models/mtp.ggufOr let llama.cpp resolve the draft model from Hugging Face directly:
python loadmodel.py --llm ./models/main.gguf \
--spec-type draft-mtp \
--spec-draft-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-IQ1_MThe app deliberately does not guess or download a sibling MTP file. Use the
Library page, --spec-draft-model <path-or-hf-spec>, or
--spec-draft-hf <repo[:quant]> to point at the draft GGUF you want. Common
tuning fields such as --spec-draft-n-max,
--spec-draft-p-min, --spec-draft-ngl, draft KV cache types, draft CPU/MoE
placement, and draft thread affinity are exposed directly. Leaving them blank
preserves the upstream llama.cpp defaults.
MTP is guarded against unsupported launches: it is only accepted for LLM mode,
cannot be combined with --mmproj, and refuses duplicate speculative flags in
--extra.
Validated MTP smoke test, 2026-05-30:
./bin/llama-server \
--model models/Qwen3.6-35B-A3B-UD-IQ1_M.gguf \
--host 127.0.0.1 --port 45650 \
--ctx-size 1024 --parallel 1 \
--n-gpu-layers 999 \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--flash-attn onThe test used llama.cpp commit 764f1e6 and
unsloth/Qwen3.6-35B-A3B-MTP-GGUF file
Qwen3.6-35B-A3B-UD-IQ1_M.gguf downloaded to models/. The verified file
size was 11366414624 bytes and SHA256 was
223db8b347ebe3d75f43a0fd998eb86148c9744635066c6782662208ddc14867.
Successful startup logs included adding speculative implementation 'draft-mtp' and speculative decoding context initialized; /health
returned {"status":"ok"}, and a small /completion request reported draft
timings with draft_n=6 and draft_n_accepted=3.
If a resumed Hugging Face download has the same size but fails GGUF parsing with
GGML_ASSERT(!key.empty()), delete the partial artifact and force a clean
download. A clean file must match the SHA256 above before using it for runtime
tests.
Interactive distributed launcher:
python loadmodel_dist_cli.pyThis flow can:
- scan private subnets for RPC workers
- manage the worker host list
- optionally start a local
rpc-server - launch
llama-cliwith--rpcworkers
Standalone rpc-server helper:
python rpc_server_cli.py --help
python rpc_server_cli.py --host 0.0.0.0 --port 5515 --devices 0rpc_server_cli.py requires ./bin/rpc-server to exist.
Backend startup:
python web_cli.py --init
python web_cli.pyThe backend:
- binds to
0.0.0.0by default - requires a bearer token on every request except
GET /api/health - persists managed instances, builds, and benchmark runs in
.web_state.json - tees logs to
web/logs/<id>.log - can re-adopt orphaned repo-launched
llama-serverprocesses on startup - can force that same recovery flow through
POST /api/instances/recover
API surface
- Health:
GET /api/health - Memory:
GET /api/memory/gpus,POST /api/memory/plan,POST /api/memory/auto-split - Models:
GET /api/models/local,GET /api/models/binary-caps,POST /api/models/download - Instances:
GET /POST /api/instances,GET /api/instances/{id},POST /api/instances/{id}/start|stop|restart,DELETE /api/instances/{id},POST /api/instances/recover,WS /api/instances/{id}/logs?token=... - Builds:
GET /POST /api/builds,GET /api/builds/{id},POST /api/builds/{id}/stop,WS /api/builds/{id}/logs?token=... - Benchmarks:
GET /POST /api/benchmarks,GET /api/benchmarks/{id},POST /api/benchmarks/{id}/stop,WS /api/benchmarks/{id}/logs?token=...
Full schema: GET /docs
- The bearer token is the only built-in auth layer.
- Keep
.web_config.jsonreadable only by you. - Prefer binding to
127.0.0.1when you do not need remote access. - WebSocket endpoints use
?token=because browsers cannot attachAuthorizationheaders during the upgrade request. - If you expose the backend beyond a trusted LAN, put it behind HTTPS.
Refresh screenshot assets
cd web/frontend
npx playwright install chromium
WEB_BEARER_TOKEN="$(python - <<'PY'
import json
print(json.load(open('../../.web_config.json', 'r', encoding='utf-8'))['token'])
PY
)" npm run screenshots:readme./start uses ./venv/bin/python and offers a small menu:
./start
./start autodevops
./start loadmodel
./start web [--init]
./start --helpRun unit tests:
python -m unittest discover -s testsCurrent tests cover:
- CUDA home resolution behavior in
autodevops.py - option and config assembly helpers in
autodevops_cli.py
run/ currently includes:
run_qwen30b_llm.shrun_qwen_embed8b.shrun_qwen_reranker8b.sh
These are example launchers for fixed ports and model targets.









