LTX-2.3 FLF2V Worker

A standalone Python worker that produces first-last-frame walkthrough videos on a local GPU (RTX 5090, 32 GB) using LTX-2.3 22B Distilled FP8 via ComfyUI, then ships the result to a remote queue.

This is the local-inference half of a hybrid pipeline: a hosted Next.js API queues jobs in Postgres, this worker polls for them, generates the video, and uploads the MP4 back. Replacing a hosted inference API with a local-GPU worker drops per-clip marginal cost to electricity.

What it does

Registers with the API and polls /api/videogen/pending every 5 s (auth via x-videogen-secret header).
Claims a job, downloads its source frames (2 - 6 images), and spins up ComfyUI on demand.
For each adjacent pair of frames, builds a ComfyUI API workflow targeting LTX-2.3 (8 steps, CFG = 1, NAG) and submits it. Output is trimmed to exactly 5.0 s with ffmpeg.
After each clip ComfyUI is restarted (image-cache flush) and node IDs are offset for the next clip (workflow-hash bust).
Once every clip is done they are concatenated, the watermark PNG is overlaid for free-tier jobs, the result is PUT to Vercel Blob, and PATCH /api/videogen/{id}/complete is called (with 3-retry exponential backoff).
A background heartbeat thread pings /api/videogen/heartbeat every 15 s so the API never marks the worker dead during a long job.

Why these particular settings

LTX-2.3 22B Distilled FP8, not Dev FP8

The Dev FP8 checkpoint (ltx-2.3-22b-dev-fp8.safetensors) renders fine on 4090-class hardware but ships visible "screen-door" artifacts on Blackwell (5090). The bug is tracked upstream as ComfyUI-LTXVideo#379. Distilled FP8 does not have this regression and is also faster (8 steps vs 30).

CFG = 1.0, 8 steps, `euler_ancestral_cfg_pp`

Distilled checkpoints are trained without classifier-free guidance, so any CFG above 1.0 visibly degrades them — colors flatten and motion stiffens. The 8-step sigma schedule

1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875, 0.0

comes from Lightricks' reference code and is hard-coded into ManualSigmas; deviating from these exact values is the most reliable way to wreck output quality. euler_ancestral_cfg_pp is the matched sampler — euler alone works, but the ancestral variant gives slightly cleaner temporal motion at the same step count.

GGUF Gemma 3, not FP8 Gemma

The text encoder ships in two formats: gemma_3_12B_it_fp8_scaled.safetensors and gemma-3-12b-it-qat-Q4_K_M.gguf. The FP8 variant uses a kernel path that segfaults on Blackwell SMs (cu130 / driver 590). GGUF Q4 is loaded through DualCLIPLoaderGGUF and bypasses the broken kernel entirely while giving up no meaningful prompt-fidelity at the resolutions we render.

NAG over STG/APG

NAG (Normalized Attention Guidance) modifies cross-attention in place, costing one forward pass. STG would also work but needs three passes (positive + negative + perturbed), which roughly doubles per-step latency. With the distilled model at 8 steps already producing acceptable output, the additional cost is not worth it. Tuned values (scale 3.0, tau 2.0, alpha 0.2) are well below the KJNodes defaults — at 1080p the higher defaults over-sharpen without improving sharpness perceptually.

`LTXVImgToVideoInplaceKJ`, not `LTXVAddGuide`

LTXVAddGuide is the documented FLF2V primitive: pin frame 0 to one image, frame -1 to another, generate the middle. In practice it produces a "bounceback" — the last ~2 s of the clip drifts back toward the first frame because the negative conditioning leaks across the time axis. The KJNodes "Inplace" variant writes the encoded frames directly into the latent at fixed time positions, which is rigid enough to prevent the artifact. The two-image input is wired through dotted names (image.1, image.2, strength.1, …) — KJNodes-specific calling convention when num_images > 1.

Watermark as a hard error

For the free tier the watermark PNG is part of the product contract. The worker checks for it before running inference and refuses the job if it's missing, rather than discovering 4 minutes in that it can't legally publish the output. A previous version silently passed through unwatermarked video when the file was absent on a fresh machine.

Cache-bust between clips

Two independent caches will return stale results if you don't fight them:

LoadImage cache: ComfyUI hashes images by filename. We upload each pair with a unique uuid-suffixed name and restart ComfyUI between clips.
Workflow-result cache: the prompt endpoint dedupes identical workflows. Adding a node_offset = clip_index * 1000 to every node ID produces a fresh workflow hash even when the only difference is which two images get loaded.

Both defences in place is overkill for any single failure mode and appropriate as a belt-and-braces measure when a single bad clip ruins the whole stitched output.

Files

ltx-worker.py                       # the worker
workflows/ltx23_flf2v_api.json      # reference ComfyUI API JSON (3060-era GGUF baseline)
systemd/ltx-worker.service          # generic systemd user service template
.env.example                        # required environment variables
requirements.txt                    # pip deps

Run

python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# fill in API_URL, VIDEOGEN_WORKER_SECRET, BLOB_READ_WRITE_TOKEN,
# COMFYUI_DIR, COMFYUI_PYTHON, WATERMARK_PATH
source .env
python -u ltx-worker.py

Or install as a user-level systemd service — see systemd/ltx-worker.service.

ComfyUI must be reachable on $COMFYUI_URL (default http://localhost:8188) or installable at $COMFYUI_DIR; the worker boots it on demand. Required ComfyUI custom-node packs:

ComfyUI-GGUF — DualCLIPLoaderGGUF
ComfyUI-LTXVideo — LTXV* nodes, ManualSigmas
ComfyUI-KJNodes — LTXVImgToVideoInplaceKJ, LTX2_NAG
ComfyUI-VideoHelperSuite — VHS_VideoCombine

Notes

Polling target shape: { job: { id, stageData: { frames: string[], clipPrompts: string[], watermark: bool } } | null }. A back-compat fallback handles the older {imageUrl, endImageUrl, klingPrompt} single- clip stageData shape.
Output Blob path: spatial-story/output/{jobId}.mp4.
Per-clip ComfyUI budget is 10 min; per-job wall-clock budget is 12 min. Both are checked between clips so a stuck job fails fast rather than hanging the queue.
The included workflows/ltx23_flf2v_api.json is the original GGUF workflow from the 3060 era — the worker builds its own FP8 workflow at runtime, but the JSON is useful as a node-graph reference when debugging in the ComfyUI UI.

License

MIT — see LICENSE.

Author: Gal Cohen — github.com/outlast85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LTX-2.3 FLF2V Worker

What it does

Why these particular settings

LTX-2.3 22B Distilled FP8, not Dev FP8

CFG = 1.0, 8 steps, `euler_ancestral_cfg_pp`

GGUF Gemma 3, not FP8 Gemma

NAG over STG/APG

`LTXVImgToVideoInplaceKJ`, not `LTXVAddGuide`

Watermark as a hard error

Cache-bust between clips

Files

Run

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
systemd		systemd
workflows		workflows
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ltx-worker.py		ltx-worker.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LTX-2.3 FLF2V Worker

What it does

Why these particular settings

LTX-2.3 22B Distilled FP8, not Dev FP8

CFG = 1.0, 8 steps, euler_ancestral_cfg_pp

GGUF Gemma 3, not FP8 Gemma

NAG over STG/APG

LTXVImgToVideoInplaceKJ, not LTXVAddGuide

Watermark as a hard error

Cache-bust between clips

Files

Run

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

CFG = 1.0, 8 steps, `euler_ancestral_cfg_pp`

`LTXVImgToVideoInplaceKJ`, not `LTXVAddGuide`

Packages