Bootstrap scripts for rebuilding the local inference stack from this repository.
The scripts are numbered, idempotent, and keep generated source, binaries, and
state under build/ wherever possible.
llama.cppwith the Vulkan backend, linked intobuild/bin/.- GGUF model files listed in
models.list. - Lemonade Server as the front-door OpenAI-compatible API, serving NPU models via
FastFlowLM,
qwen36-35bthrough Lemonade's llama.cpp Vulkan backend, andSD-Turbothrough Lemonade's sd-cpp Vulkan backend.
clients
|
| OpenAI-compatible requests
v
Lemonade :13305/api/v1
|
+-- qwen3-4b-FLM -> FastFlowLM / XDNA2 NPU
+-- whisper-v3-turbo-FLM -> FastFlowLM / XDNA2 NPU
+-- qwen36-35b -> llama.cpp Vulkan / Radeon iGPU
+-- SD-Turbo -> sd-cpp Vulkan / Radeon iGPU
The GPU text model is registered in Lemonade as qwen36-35b. Lemonade is the
router; there is no separate llama-router, LiteLLM, Postgres gateway, or
default standalone image server.
The text stack expects a Linux host with:
- AMD Ryzen AI MAX+ 395 class hardware, or another Vulkan-capable host with enough shared memory for the configured models.
- Vulkan runtime and readable render device.
- Mesa 26 or newer for the Vulkan path used by these models.
cmake,ninja,git,curl, C compiler,mise, and Node throughmise.- Lemonade Server (
lemond.service) and FastFlowLM.
00-prereq-check.sh verifies these assumptions without changing the host.
| Step | Script | What it does |
|---|---|---|
| 00 | 00-prereq-check.sh |
Read-only host checks for toolchain, Vulkan, render device, and Mesa. |
| 10 | 10-llama-cpp.sh |
Builds pinned llama.cpp with Vulkan and links binaries into build/bin/. |
| 20 | 20-models.sh |
Downloads GGUF files from models.list into $LOCALLLM_MODELS_DIR. |
| 40 | 40-npu-lemonade.sh |
Installs/configures Lemonade, FastFlowLM NPU models, Qwen GPU GGUF, and SD-Turbo through sd-cpp. |
| 50 | 50-stable-diffusion.sh |
Optional standalone image stack outside Lemonade: builds stable-diffusion.cpp, fetches SD-Turbo, and smoke-tests image generation. |
| 99 | 99-verify.sh |
Read-only verification for binaries, models, Lemonade service, retired services, and a completion round trip. |
run-all.sh runs the full Lemonade stack and verification. It does not run the
standalone 50-stable-diffusion.sh path because Lemonade's SD-Turbo backend is
the default image-generation route.
./run-all.shTo build the optional standalone image stack:
./50-stable-diffusion.shTo skip image-model downloads while checking the build path:
SD_FETCH_MODELS=0 SD_SMOKE_TEST=0 ./50-stable-diffusion.shThe docs and helper scripts use 10.0.0.30 as the expected LAN address for the
localLLM host. If this host moves to a different address, set
LOCALLLM_BIND_HOST before running the service, helpers, or verification.
# List available Lemonade models.
./use-model.sh list
# Warm a configured model.
./use-model.sh load qwen36-35b
# Unload one model, or all models from Lemonade.
./use-model.sh unload qwen36-35b
./use-model.sh unloadOpenAI-compatible request:
BASE="http://10.0.0.30:${LEMONADE_PORT:-13305}/api/v1"
curl "$BASE/chat/completions" \
-H 'Content-Type: application/json' \
-d '{"model":"qwen36-35b","messages":[{"role":"user","content":"Reply: ok /no_think"}],"max_tokens":8}'Useful operations:
curl "$BASE/models"
journalctl -u lemond -f
lemonade list --downloadedLemonade exposes the image backend as model SD-Turbo:
BASE="http://10.0.0.30:${LEMONADE_PORT:-13305}/api/v1"
curl "$BASE/images/generations" \
-H 'Content-Type: application/json' \
-d '{"model":"SD-Turbo","prompt":"a clean product render of a small brass desk lamp","size":"512x512","n":1}'The standalone stable-diffusion.cpp lane is still available if you want SD-Turbo
outside Lemonade. After 50-stable-diffusion.sh completes:
SD_MODELS_DIR="${SD_MODELS_DIR:-$HOME/sdmodels}"
build/bin/sd-cli \
--diffusion-model "$SD_MODELS_DIR/flux1-schnell-q8_0.gguf" \
--vae "$SD_MODELS_DIR/ae.safetensors" \
--clip_l "$SD_MODELS_DIR/clip_l.safetensors" \
--t5xxl "$SD_MODELS_DIR/t5xxl_fp16.safetensors" \
--cfg-scale 1.0 \
--sampling-method euler \
--steps 4 \
-W 1024 -H 1024 \
-p "your prompt, plain white background" \
-o asset.pngThe smoke-test output is build/sd-smoke.png.
| Variable | Default | Used by |
|---|---|---|
LOCALLLM_MODELS_DIR |
~/models |
20-models.sh, 40-npu-lemonade.sh, 99-verify.sh |
LOCALLLM_BIND_HOST |
10.0.0.30 |
40-npu-lemonade.sh, use-model.sh, 99-verify.sh |
LEMONADE_PORT |
13305 |
40-npu-lemonade.sh, use-model.sh, 99-verify.sh |
LEMONADE_GPU_MODEL_ID |
qwen36-35b |
40-npu-lemonade.sh |
LEMONADE_GPU_CHECKPOINT |
Qwen3.6 GGUF Hugging Face checkpoint | 40-npu-lemonade.sh |
LEMONADE_IMAGE_MODEL_ID |
SD-Turbo |
40-npu-lemonade.sh |
LEMONADE_IMAGE_SIZE |
512 |
40-npu-lemonade.sh |
LEMONADE_IMAGE_STEPS |
4 |
40-npu-lemonade.sh |
LEMONADE_IMAGE_CFG |
1.0 |
40-npu-lemonade.sh |
LEMONADE_GPU_CTX_SIZE |
262144 |
40-npu-lemonade.sh |
LEMONADE_LLAMACPP_ARGS |
Qwen tuned Vulkan args | 40-npu-lemonade.sh |
LLAMA_CPP_REPO |
repository URL configured in the script | 10-llama-cpp.sh |
LLAMA_CPP_REF |
pinned commit in 10-llama-cpp.sh |
10-llama-cpp.sh |
SD_CPP_REPO |
repository URL configured in the script | 50-stable-diffusion.sh |
SD_CPP_REF |
pinned commit in 50-stable-diffusion.sh |
50-stable-diffusion.sh |
SD_MODELS_DIR |
~/sdmodels |
50-stable-diffusion.sh |
SD_FETCH_MODELS |
1 |
50-stable-diffusion.sh |
SD_SMOKE_TEST |
1 |
50-stable-diffusion.sh |
- Add the GGUF file to
models.list. - Run
./20-models.sh. - Register it in
40-npu-lemonade.shwithlemonade pull <name> --recipe llamacpp. - Run
./40-npu-lemonade.sh. - Run
LOCALLLM_VERIFY_MODEL=<name> ./99-verify.sh.
build/ contains generated source checkouts, compiled binaries, linked binaries,
downloaded helper binaries, and marker files. Lemonade model registrations and
backend state are managed by Lemonade Server.