llama-cpp-autodeploy

Turn a local llama.cpp checkout into something you can actually use: build it, launch models, inspect GPU pressure, recover orphaned servers, and control everything from a browser or terminal.

Why | Screenshots | Start Here | UI | Technical Reference

Why this is useful

You do not have to manually juggle llama.cpp builds, llama-server launch commands, VRAM checks, and logs across separate scripts.
The web UI gives you one place to build, launch, monitor, and recover running instances.
If the backend restarts or crashes, it can re-adopt repo-launched llama-server processes instead of losing control of them.
The browser UI is layered on top of the same local tools, so terminal users and UI users are working against the same repo, binaries, and models.

Why it is easy to use

One repo checkout.
One Python environment.
One backend command to bring up the control plane.
One token for the browser UI.
The backend can serve the built frontend directly, so normal users do not need to run a separate frontend dev server.

See the app

Overview

The main control-plane page shows backend health, fleet status, host load, GPU pressure, and recent activity without making you dig through logs first.

GPU Runtime Detail

Expandable GPU detail shows compute load, VRAM use, and which managed processes currently own memory on each device.

Instances

The Instances page gives you a proper launcher for llama-server and a way to recover servers that survived a backend restart.

Builds

The Builds page wraps autodevops.py with real options, history, command preview, and logs.

Benchmarks

The Benchmarks page runs llama-bench, keeps the command and logs, and stores parsed throughput results. The capture below is a live run of unsloth/Qwen3.5-0.8B-GGUF:Q4_K_XL pinned to the RTX 4060.

Memory Planning

The Memory page gives you a quick VRAM and RAM view before you launch a model.

Model Library

The Library page shows local GGUFs and pulls new ones straight from Hugging Face.

Mobile

The same control plane also works on narrow screens.

Start here

This is the simplest browser-first path for a normal user.

1. Clone the repo

git clone https://github.com/CesarPetrescu/llama-cpp-autodeploy.git
cd llama-cpp-autodeploy

2. Install the backend requirements

python3 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt

3. Build the frontend once

cd web/frontend
npm install
npm run build
cd ../..

That gives the backend a production frontend to serve at /.

4. Initialize and start the backend

python web_cli.py --init
python web_cli.py

What happens here:

python web_cli.py --init creates .web_config.json and prints a bearer token.
python web_cli.py starts the backend on port 8787 by default.
If web/frontend/dist/ exists, the backend also serves the frontend UI.

5. Open the app

Open http://localhost:8787.

On first use:

go to Settings
paste the bearer token printed during --init
save it once

After that, the app can build llama.cpp, launch models, show logs, inspect GPU pressure, and recover managed instances from the browser.

6. Frontend development mode (optional)

If you are editing the UI instead of just using it:

cd web/frontend
npm run dev

That starts the Vite frontend at http://localhost:5173 and proxies API requests to the backend at http://127.0.0.1:8787.

What you can do in the UI

Page	What it is for
Dashboard	See backend health, host CPU/RAM/load, GPU pressure, builds, and fleet state
Instances	Create, recover, start, stop, restart, and delete `llama-server` processes
Instance logs	Watch live stdout with pause/resume
Memory	Estimate placement and VRAM needs before launch
Library	Scan local GGUFs and download new ones from Hugging Face
Builds	Run `autodevops.py`, inspect supported options, and stream logs
Benchmarks	Run `llama-bench`, pin tests to specific GPUs, and keep structured throughput history
Settings	Set backend URL and bearer token

How the app fits together

Layer	Role
`autodevops.py`	Build local `llama.cpp` binaries
`loadmodel.py`	Launch `llama-server` and reranker processes
`memory_utils.py`	Probe VRAM, RAM, and placement estimates
`web/backend/`	Auth, state, logs, recovery, and API surface
`web/frontend/`	Browser UI for overview, builds, instances, memory, and library

Technical reference

Requirements

Linux with Python 3.10+.
Build tools for llama.cpp: git, cmake, make, gcc, g++, pkg-config.
NVIDIA drivers and CUDA toolkit if you want CUDA builds or GPU runtime.
Optional BLAS libraries:
- Intel MKL for --blas mkl
- OpenBLAS for --blas openblas

Python dependencies are in requirements.txt.

Build llama.cpp

Interactive build flow:

python autodevops_cli.py

Non-interactive build flow:

python autodevops.py --help
python autodevops.py --ref latest --now

Supported build flags:

Flag	Meaning
`--ref <tag	branch
`--now`	Build immediately instead of waiting for the scheduled path
`--fast-math`	Pass fast-math CUDA flags to NVCC
`--force-mmq {auto,on,off}`	Control MMQ CUDA kernels
`--blas {auto,openblas,mkl,off}`	Choose the CPU BLAS backend
`--distributed`	Build GGML RPC support
`--cpu-only`	Skip NVIDIA driver prechecks

Launch services

Interactive launcher:

python loadmodel_cli.py

Unified launcher:

python loadmodel.py --help

loadmodel.py supports three mutually exclusive modes:

Mode	Result
`--llm`	Start `./bin/llama-server` for completion/chat
`--embed`	Start `./bin/llama-server` for embeddings
`--rerank`	Start the Transformers reranker HTTP service

Examples:

# LLM (local GGUF)
python loadmodel.py --llm ./models/model.gguf --port 45540

# Embeddings (download GGUF from HF repo, auto-select quant/file)
python loadmodel.py --embed Qwen/Qwen3-Embedding-8B-GGUF:Q8_0 --port 45541

# Reranker HTTP server
python loadmodel.py --rerank Qwen/Qwen3-Reranker-8B --host 127.0.0.1 --port 45542

For MoE-capable llama-server builds, loadmodel.py also accepts:

--cpu-moe
--n-cpu-moe <N>

If the local llama-server binary does not expose these flags, loadmodel.py exits with a rebuild hint.

For recent llama.cpp builds with MTP support, both the web Instances page and the terminal launchers expose structured speculative decoding fields. The minimal launch form is:

python loadmodel.py --llm ./models/Qwen3.6-35B-A3B-UD-IQ1_M.gguf \
  --spec-type draft-mtp

When the draft/MTP GGUF is a separate file, provide it explicitly:

python loadmodel.py --llm ./models/main.gguf \
  --spec-type draft-mtp \
  --spec-draft-model ./models/mtp.gguf

Or let llama.cpp resolve the draft model from Hugging Face directly:

python loadmodel.py --llm ./models/main.gguf \
  --spec-type draft-mtp \
  --spec-draft-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-IQ1_M

The app deliberately does not guess or download a sibling MTP file. Use the Library page, --spec-draft-model <path-or-hf-spec>, or --spec-draft-hf <repo[:quant]> to point at the draft GGUF you want. Common tuning fields such as --spec-draft-n-max, --spec-draft-p-min, --spec-draft-ngl, draft KV cache types, draft CPU/MoE placement, and draft thread affinity are exposed directly. Leaving them blank preserves the upstream llama.cpp defaults.

MTP is guarded against unsupported launches: it is only accepted for LLM mode, cannot be combined with --mmproj, and refuses duplicate speculative flags in --extra.

Validated MTP smoke test, 2026-05-30:

./bin/llama-server \
  --model models/Qwen3.6-35B-A3B-UD-IQ1_M.gguf \
  --host 127.0.0.1 --port 45650 \
  --ctx-size 1024 --parallel 1 \
  --n-gpu-layers 999 \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --flash-attn on

The test used llama.cpp commit 764f1e6 and unsloth/Qwen3.6-35B-A3B-MTP-GGUF file Qwen3.6-35B-A3B-UD-IQ1_M.gguf downloaded to models/. The verified file size was 11366414624 bytes and SHA256 was 223db8b347ebe3d75f43a0fd998eb86148c9744635066c6782662208ddc14867. Successful startup logs included adding speculative implementation 'draft-mtp' and speculative decoding context initialized; /health returned {"status":"ok"}, and a small /completion request reported draft timings with draft_n=6 and draft_n_accepted=3.

If a resumed Hugging Face download has the same size but fails GGUF parsing with GGML_ASSERT(!key.empty()), delete the partial artifact and force a clean download. A clean file must match the SHA256 above before using it for runtime tests.

Distributed inference (RPC)

Interactive distributed launcher:

python loadmodel_dist_cli.py

This flow can:

scan private subnets for RPC workers
manage the worker host list
optionally start a local rpc-server
launch llama-cli with --rpc workers

Standalone rpc-server helper:

python rpc_server_cli.py --help
python rpc_server_cli.py --host 0.0.0.0 --port 5515 --devices 0

rpc_server_cli.py requires ./bin/rpc-server to exist.

Web backend

Backend startup:

python web_cli.py --init
python web_cli.py

The backend:

binds to 0.0.0.0 by default
requires a bearer token on every request except GET /api/health
persists managed instances, builds, and benchmark runs in .web_state.json
tees logs to web/logs/<id>.log
can re-adopt orphaned repo-launched llama-server processes on startup
can force that same recovery flow through POST /api/instances/recover

API surface

Health: GET /api/health
Memory: GET /api/memory/gpus, POST /api/memory/plan, POST /api/memory/auto-split
Models: GET /api/models/local, GET /api/models/binary-caps, POST /api/models/download
Instances: GET /POST /api/instances, GET /api/instances/{id}, POST /api/instances/{id}/start|stop|restart, DELETE /api/instances/{id}, POST /api/instances/recover, WS /api/instances/{id}/logs?token=...
Builds: GET /POST /api/builds, GET /api/builds/{id}, POST /api/builds/{id}/stop, WS /api/builds/{id}/logs?token=...
Benchmarks: GET /POST /api/benchmarks, GET /api/benchmarks/{id}, POST /api/benchmarks/{id}/stop, WS /api/benchmarks/{id}/logs?token=...

Full schema: GET /docs

Security notes

The bearer token is the only built-in auth layer.
Keep .web_config.json readable only by you.
Prefer binding to 127.0.0.1 when you do not need remote access.
WebSocket endpoints use ?token= because browsers cannot attach Authorization headers during the upgrade request.
If you expose the backend beyond a trusted LAN, put it behind HTTPS.

Refresh screenshot assets

cd web/frontend
npx playwright install chromium
WEB_BEARER_TOKEN="$(python - <<'PY'
import json
print(json.load(open('../../.web_config.json', 'r', encoding='utf-8'))['token'])
PY
)" npm run screenshots:readme

Convenience launcher

./start uses ./venv/bin/python and offers a small menu:

./start
./start autodevops
./start loadmodel
./start web [--init]
./start --help

Tests

Run unit tests:

python -m unittest discover -s tests

Current tests cover:

CUDA home resolution behavior in autodevops.py
option and config assembly helpers in autodevops_cli.py

Sample scripts

run/ currently includes:

run_qwen30b_llm.sh
run_qwen_embed8b.sh
run_qwen_reranker8b.sh

These are example launchers for fixed ports and model targets.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
docs/screenshots		docs/screenshots
tests		tests
web		web
.codex		.codex
.env.local		.env.local
.gitignore		.gitignore
README.md		README.md
autodevops.py		autodevops.py
autodevops_cli.py		autodevops_cli.py
constants.py		constants.py
keybindings.py		keybindings.py
loadmodel.py		loadmodel.py
loadmodel_cli.py		loadmodel_cli.py
loadmodel_dist_cli.py		loadmodel_dist_cli.py
memory_utils.py		memory_utils.py
process_utils.py		process_utils.py
requirements.txt		requirements.txt
reranker.py		reranker.py
rpc_server_cli.py		rpc_server_cli.py
start		start
state_utils.py		state_utils.py
tui_base.py		tui_base.py
tui_utils.py		tui_utils.py
validators.py		validators.py
web_cli.py		web_cli.py

Folders and files

Latest commit

History

Repository files navigation

llama-cpp-autodeploy

Why this is useful

Why it is easy to use

See the app

Overview

GPU Runtime Detail

Instances

Builds

Benchmarks

Memory Planning

Model Library

Mobile

Start here

1. Clone the repo

2. Install the backend requirements

3. Build the frontend once

4. Initialize and start the backend

5. Open the app

6. Frontend development mode (optional)

What you can do in the UI

How the app fits together

Technical reference

Requirements

Build llama.cpp

Launch services

Distributed inference (RPC)

Web backend

Security notes

Convenience launcher

Tests

Sample scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages