A standalone Python worker that produces first-last-frame walkthrough videos on a local GPU (RTX 5090, 32 GB) using LTX-2.3 22B Distilled FP8 via ComfyUI, then ships the result to a remote queue.
This is the local-inference half of a hybrid pipeline: a hosted Next.js API queues jobs in Postgres, this worker polls for them, generates the video, and uploads the MP4 back. Replacing a hosted inference API with a local-GPU worker drops per-clip marginal cost to electricity.
- Registers with the API and polls
/api/videogen/pendingevery 5 s (auth viax-videogen-secretheader). - Claims a job, downloads its source frames (2 - 6 images), and spins up ComfyUI on demand.
- For each adjacent pair of frames, builds a ComfyUI API workflow targeting LTX-2.3 (8 steps, CFG = 1, NAG) and submits it. Output is trimmed to exactly 5.0 s with ffmpeg.
- After each clip ComfyUI is restarted (image-cache flush) and node IDs are offset for the next clip (workflow-hash bust).
- Once every clip is done they are concatenated, the watermark PNG is
overlaid for free-tier jobs, the result is
PUTto Vercel Blob, andPATCH /api/videogen/{id}/completeis called (with 3-retry exponential backoff). - A background heartbeat thread pings
/api/videogen/heartbeatevery 15 s so the API never marks the worker dead during a long job.
The Dev FP8 checkpoint (ltx-2.3-22b-dev-fp8.safetensors) renders fine on
4090-class hardware but ships visible "screen-door" artifacts on Blackwell
(5090). The bug is tracked upstream as ComfyUI-LTXVideo#379. Distilled
FP8 does not have this regression and is also faster (8 steps vs 30).
Distilled checkpoints are trained without classifier-free guidance, so any CFG above 1.0 visibly degrades them — colors flatten and motion stiffens. The 8-step sigma schedule
1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875, 0.0
comes from Lightricks' reference code and is hard-coded into ManualSigmas;
deviating from these exact values is the most reliable way to wreck output
quality. euler_ancestral_cfg_pp is the matched sampler — euler alone
works, but the ancestral variant gives slightly cleaner temporal motion at
the same step count.
The text encoder ships in two formats: gemma_3_12B_it_fp8_scaled.safetensors
and gemma-3-12b-it-qat-Q4_K_M.gguf. The FP8 variant uses a kernel path
that segfaults on Blackwell SMs (cu130 / driver 590). GGUF Q4 is loaded
through DualCLIPLoaderGGUF and bypasses the broken kernel entirely while
giving up no meaningful prompt-fidelity at the resolutions we render.
NAG (Normalized Attention Guidance) modifies cross-attention in place,
costing one forward pass. STG would also work but needs three passes
(positive + negative + perturbed), which roughly doubles per-step latency.
With the distilled model at 8 steps already producing acceptable output,
the additional cost is not worth it. Tuned values (scale 3.0, tau 2.0,
alpha 0.2) are well below the KJNodes defaults — at 1080p the higher
defaults over-sharpen without improving sharpness perceptually.
LTXVAddGuide is the documented FLF2V primitive: pin frame 0 to one image,
frame -1 to another, generate the middle. In practice it produces a
"bounceback" — the last ~2 s of the clip drifts back toward the first frame
because the negative conditioning leaks across the time axis. The KJNodes
"Inplace" variant writes the encoded frames directly into the latent at
fixed time positions, which is rigid enough to prevent the artifact. The
two-image input is wired through dotted names (image.1, image.2,
strength.1, …) — KJNodes-specific calling convention when num_images > 1.
For the free tier the watermark PNG is part of the product contract. The worker checks for it before running inference and refuses the job if it's missing, rather than discovering 4 minutes in that it can't legally publish the output. A previous version silently passed through unwatermarked video when the file was absent on a fresh machine.
Two independent caches will return stale results if you don't fight them:
- LoadImage cache: ComfyUI hashes images by filename. We upload each pair with a unique uuid-suffixed name and restart ComfyUI between clips.
- Workflow-result cache: the prompt endpoint dedupes identical
workflows. Adding a
node_offset = clip_index * 1000to every node ID produces a fresh workflow hash even when the only difference is which two images get loaded.
Both defences in place is overkill for any single failure mode and appropriate as a belt-and-braces measure when a single bad clip ruins the whole stitched output.
ltx-worker.py # the worker
workflows/ltx23_flf2v_api.json # reference ComfyUI API JSON (3060-era GGUF baseline)
systemd/ltx-worker.service # generic systemd user service template
.env.example # required environment variables
requirements.txt # pip deps
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# fill in API_URL, VIDEOGEN_WORKER_SECRET, BLOB_READ_WRITE_TOKEN,
# COMFYUI_DIR, COMFYUI_PYTHON, WATERMARK_PATH
source .env
python -u ltx-worker.pyOr install as a user-level systemd service — see systemd/ltx-worker.service.
ComfyUI must be reachable on $COMFYUI_URL (default http://localhost:8188)
or installable at $COMFYUI_DIR; the worker boots it on demand. Required
ComfyUI custom-node packs:
ComfyUI-GGUF—DualCLIPLoaderGGUFComfyUI-LTXVideo—LTXV*nodes,ManualSigmasComfyUI-KJNodes—LTXVImgToVideoInplaceKJ,LTX2_NAGComfyUI-VideoHelperSuite—VHS_VideoCombine
- Polling target shape:
{ job: { id, stageData: { frames: string[], clipPrompts: string[], watermark: bool } } | null }. A back-compat fallback handles the older{imageUrl, endImageUrl, klingPrompt}single- clip stageData shape. - Output Blob path:
spatial-story/output/{jobId}.mp4. - Per-clip ComfyUI budget is 10 min; per-job wall-clock budget is 12 min. Both are checked between clips so a stuck job fails fast rather than hanging the queue.
- The included
workflows/ltx23_flf2v_api.jsonis the original GGUF workflow from the 3060 era — the worker builds its own FP8 workflow at runtime, but the JSON is useful as a node-graph reference when debugging in the ComfyUI UI.
MIT — see LICENSE.
Author: Gal Cohen — github.com/outlast85