localLLM

Bootstrap scripts for rebuilding the local inference stack from this repository. The scripts are numbered, idempotent, and keep generated source, binaries, and state under build/ wherever possible.

What This Builds

llama.cpp with the Vulkan backend, linked into build/bin/.
GGUF model files listed in models.list.
Lemonade Server as the front-door OpenAI-compatible API, serving NPU models via FastFlowLM, qwen36-35b through Lemonade's llama.cpp Vulkan backend, and SD-Turbo through Lemonade's sd-cpp Vulkan backend.

Architecture

clients
  |
  | OpenAI-compatible requests
  v
Lemonade :13305/api/v1
  |
  +-- qwen3-4b-FLM          -> FastFlowLM / XDNA2 NPU
  +-- whisper-v3-turbo-FLM  -> FastFlowLM / XDNA2 NPU
  +-- qwen36-35b            -> llama.cpp Vulkan / Radeon iGPU
  +-- SD-Turbo              -> sd-cpp Vulkan / Radeon iGPU

The GPU text model is registered in Lemonade as qwen36-35b. Lemonade is the router; there is no separate llama-router, LiteLLM, Postgres gateway, or default standalone image server.

Host Assumptions

The text stack expects a Linux host with:

AMD Ryzen AI MAX+ 395 class hardware, or another Vulkan-capable host with enough shared memory for the configured models.
Vulkan runtime and readable render device.
Mesa 26 or newer for the Vulkan path used by these models.
cmake, ninja, git, curl, C compiler, mise, and Node through mise.
Lemonade Server (lemond.service) and FastFlowLM.

00-prereq-check.sh verifies these assumptions without changing the host.

Bootstrap Scripts

Step	Script	What it does
00	`00-prereq-check.sh`	Read-only host checks for toolchain, Vulkan, render device, and Mesa.
10	`10-llama-cpp.sh`	Builds pinned `llama.cpp` with Vulkan and links binaries into `build/bin/`.
20	`20-models.sh`	Downloads GGUF files from `models.list` into `$LOCALLLM_MODELS_DIR`.
40	`40-npu-lemonade.sh`	Installs/configures Lemonade, FastFlowLM NPU models, Qwen GPU GGUF, and SD-Turbo through sd-cpp.
50	`50-stable-diffusion.sh`	Optional standalone image stack outside Lemonade: builds `stable-diffusion.cpp`, fetches SD-Turbo, and smoke-tests image generation.
99	`99-verify.sh`	Read-only verification for binaries, models, Lemonade service, retired services, and a completion round trip.

run-all.sh runs the full Lemonade stack and verification. It does not run the standalone 50-stable-diffusion.sh path because Lemonade's SD-Turbo backend is the default image-generation route.

Quick Start

./run-all.sh

To build the optional standalone image stack:

./50-stable-diffusion.sh

To skip image-model downloads while checking the build path:

SD_FETCH_MODELS=0 SD_SMOKE_TEST=0 ./50-stable-diffusion.sh

Text Gateway Usage

Network Default

The docs and helper scripts use 10.0.0.30 as the expected LAN address for the localLLM host. If this host moves to a different address, set LOCALLLM_BIND_HOST before running the service, helpers, or verification.

# List available Lemonade models.
./use-model.sh list

# Warm a configured model.
./use-model.sh load qwen36-35b

# Unload one model, or all models from Lemonade.
./use-model.sh unload qwen36-35b
./use-model.sh unload

OpenAI-compatible request:

BASE="http://10.0.0.30:${LEMONADE_PORT:-13305}/api/v1"
curl "$BASE/chat/completions" \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen36-35b","messages":[{"role":"user","content":"Reply: ok /no_think"}],"max_tokens":8}'

Useful operations:

curl "$BASE/models"
journalctl -u lemond -f
lemonade list --downloaded

Image Generation Usage

Lemonade exposes the image backend as model SD-Turbo:

BASE="http://10.0.0.30:${LEMONADE_PORT:-13305}/api/v1"
curl "$BASE/images/generations" \
  -H 'Content-Type: application/json' \
  -d '{"model":"SD-Turbo","prompt":"a clean product render of a small brass desk lamp","size":"512x512","n":1}'

The standalone stable-diffusion.cpp lane is still available if you want SD-Turbo outside Lemonade. After 50-stable-diffusion.sh completes:

SD_MODELS_DIR="${SD_MODELS_DIR:-$HOME/sdmodels}"
build/bin/sd-cli \
  --diffusion-model "$SD_MODELS_DIR/flux1-schnell-q8_0.gguf" \
  --vae "$SD_MODELS_DIR/ae.safetensors" \
  --clip_l "$SD_MODELS_DIR/clip_l.safetensors" \
  --t5xxl "$SD_MODELS_DIR/t5xxl_fp16.safetensors" \
  --cfg-scale 1.0 \
  --sampling-method euler \
  --steps 4 \
  -W 1024 -H 1024 \
  -p "your prompt, plain white background" \
  -o asset.png

The smoke-test output is build/sd-smoke.png.

Environment Variables

Variable	Default	Used by
`LOCALLLM_MODELS_DIR`	`~/models`	`20-models.sh`, `40-npu-lemonade.sh`, `99-verify.sh`
`LOCALLLM_BIND_HOST`	`10.0.0.30`	`40-npu-lemonade.sh`, `use-model.sh`, `99-verify.sh`
`LEMONADE_PORT`	`13305`	`40-npu-lemonade.sh`, `use-model.sh`, `99-verify.sh`
`LEMONADE_GPU_MODEL_ID`	`qwen36-35b`	`40-npu-lemonade.sh`
`LEMONADE_GPU_CHECKPOINT`	Qwen3.6 GGUF Hugging Face checkpoint	`40-npu-lemonade.sh`
`LEMONADE_IMAGE_MODEL_ID`	`SD-Turbo`	`40-npu-lemonade.sh`
`LEMONADE_IMAGE_SIZE`	`512`	`40-npu-lemonade.sh`
`LEMONADE_IMAGE_STEPS`	`4`	`40-npu-lemonade.sh`
`LEMONADE_IMAGE_CFG`	`1.0`	`40-npu-lemonade.sh`
`LEMONADE_GPU_CTX_SIZE`	`262144`	`40-npu-lemonade.sh`
`LEMONADE_LLAMACPP_ARGS`	Qwen tuned Vulkan args	`40-npu-lemonade.sh`
`LLAMA_CPP_REPO`	repository URL configured in the script	`10-llama-cpp.sh`
`LLAMA_CPP_REF`	pinned commit in `10-llama-cpp.sh`	`10-llama-cpp.sh`
`SD_CPP_REPO`	repository URL configured in the script	`50-stable-diffusion.sh`
`SD_CPP_REF`	pinned commit in `50-stable-diffusion.sh`	`50-stable-diffusion.sh`
`SD_MODELS_DIR`	`~/sdmodels`	`50-stable-diffusion.sh`
`SD_FETCH_MODELS`	`1`	`50-stable-diffusion.sh`
`SD_SMOKE_TEST`	`1`	`50-stable-diffusion.sh`

Adding a Text Model

Add the GGUF file to models.list.
Run ./20-models.sh.
Register it in 40-npu-lemonade.sh with lemonade pull <name> --recipe llamacpp.
Run ./40-npu-lemonade.sh.
Run LOCALLLM_VERIFY_MODEL=<name> ./99-verify.sh.

Generated State

build/ contains generated source checkouts, compiled binaries, linked binaries, downloaded helper binaries, and marker files. Lemonade model registrations and backend state are managed by Lemonade Server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

localLLM

What This Builds

Architecture

Host Assumptions

Bootstrap Scripts

Quick Start

Text Gateway Usage

Network Default

Image Generation Usage

Environment Variables

Adding a Text Model

Generated State

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
stable-diffusion.cpp @ 5db680c		stable-diffusion.cpp @ 5db680c
.gitignore		.gitignore
.gitmodules		.gitmodules
00-prereq-check.sh		00-prereq-check.sh
10-llama-cpp.sh		10-llama-cpp.sh
20-models.sh		20-models.sh
40-npu-lemonade.sh		40-npu-lemonade.sh
50-stable-diffusion.sh		50-stable-diffusion.sh
99-verify.sh		99-verify.sh
README.md		README.md
models.list		models.list
run-all.sh		run-all.sh
use-model.sh		use-model.sh

Folders and files

Latest commit

History

Repository files navigation

localLLM

What This Builds

Architecture

Host Assumptions

Bootstrap Scripts

Quick Start

Text Gateway Usage

Network Default

Image Generation Usage

Environment Variables

Adding a Text Model

Generated State

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages