Skip to content

outlast85/ltx-worker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LTX-2.3 FLF2V Worker

A standalone Python worker that produces first-last-frame walkthrough videos on a local GPU (RTX 5090, 32 GB) using LTX-2.3 22B Distilled FP8 via ComfyUI, then ships the result to a remote queue.

This is the local-inference half of a hybrid pipeline: a hosted Next.js API queues jobs in Postgres, this worker polls for them, generates the video, and uploads the MP4 back. Replacing a hosted inference API with a local-GPU worker drops per-clip marginal cost to electricity.

What it does

  1. Registers with the API and polls /api/videogen/pending every 5 s (auth via x-videogen-secret header).
  2. Claims a job, downloads its source frames (2 - 6 images), and spins up ComfyUI on demand.
  3. For each adjacent pair of frames, builds a ComfyUI API workflow targeting LTX-2.3 (8 steps, CFG = 1, NAG) and submits it. Output is trimmed to exactly 5.0 s with ffmpeg.
  4. After each clip ComfyUI is restarted (image-cache flush) and node IDs are offset for the next clip (workflow-hash bust).
  5. Once every clip is done they are concatenated, the watermark PNG is overlaid for free-tier jobs, the result is PUT to Vercel Blob, and PATCH /api/videogen/{id}/complete is called (with 3-retry exponential backoff).
  6. A background heartbeat thread pings /api/videogen/heartbeat every 15 s so the API never marks the worker dead during a long job.

Why these particular settings

LTX-2.3 22B Distilled FP8, not Dev FP8

The Dev FP8 checkpoint (ltx-2.3-22b-dev-fp8.safetensors) renders fine on 4090-class hardware but ships visible "screen-door" artifacts on Blackwell (5090). The bug is tracked upstream as ComfyUI-LTXVideo#379. Distilled FP8 does not have this regression and is also faster (8 steps vs 30).

CFG = 1.0, 8 steps, euler_ancestral_cfg_pp

Distilled checkpoints are trained without classifier-free guidance, so any CFG above 1.0 visibly degrades them — colors flatten and motion stiffens. The 8-step sigma schedule

1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875, 0.0

comes from Lightricks' reference code and is hard-coded into ManualSigmas; deviating from these exact values is the most reliable way to wreck output quality. euler_ancestral_cfg_pp is the matched sampler — euler alone works, but the ancestral variant gives slightly cleaner temporal motion at the same step count.

GGUF Gemma 3, not FP8 Gemma

The text encoder ships in two formats: gemma_3_12B_it_fp8_scaled.safetensors and gemma-3-12b-it-qat-Q4_K_M.gguf. The FP8 variant uses a kernel path that segfaults on Blackwell SMs (cu130 / driver 590). GGUF Q4 is loaded through DualCLIPLoaderGGUF and bypasses the broken kernel entirely while giving up no meaningful prompt-fidelity at the resolutions we render.

NAG over STG/APG

NAG (Normalized Attention Guidance) modifies cross-attention in place, costing one forward pass. STG would also work but needs three passes (positive + negative + perturbed), which roughly doubles per-step latency. With the distilled model at 8 steps already producing acceptable output, the additional cost is not worth it. Tuned values (scale 3.0, tau 2.0, alpha 0.2) are well below the KJNodes defaults — at 1080p the higher defaults over-sharpen without improving sharpness perceptually.

LTXVImgToVideoInplaceKJ, not LTXVAddGuide

LTXVAddGuide is the documented FLF2V primitive: pin frame 0 to one image, frame -1 to another, generate the middle. In practice it produces a "bounceback" — the last ~2 s of the clip drifts back toward the first frame because the negative conditioning leaks across the time axis. The KJNodes "Inplace" variant writes the encoded frames directly into the latent at fixed time positions, which is rigid enough to prevent the artifact. The two-image input is wired through dotted names (image.1, image.2, strength.1, …) — KJNodes-specific calling convention when num_images > 1.

Watermark as a hard error

For the free tier the watermark PNG is part of the product contract. The worker checks for it before running inference and refuses the job if it's missing, rather than discovering 4 minutes in that it can't legally publish the output. A previous version silently passed through unwatermarked video when the file was absent on a fresh machine.

Cache-bust between clips

Two independent caches will return stale results if you don't fight them:

  • LoadImage cache: ComfyUI hashes images by filename. We upload each pair with a unique uuid-suffixed name and restart ComfyUI between clips.
  • Workflow-result cache: the prompt endpoint dedupes identical workflows. Adding a node_offset = clip_index * 1000 to every node ID produces a fresh workflow hash even when the only difference is which two images get loaded.

Both defences in place is overkill for any single failure mode and appropriate as a belt-and-braces measure when a single bad clip ruins the whole stitched output.

Files

ltx-worker.py                       # the worker
workflows/ltx23_flf2v_api.json      # reference ComfyUI API JSON (3060-era GGUF baseline)
systemd/ltx-worker.service          # generic systemd user service template
.env.example                        # required environment variables
requirements.txt                    # pip deps

Run

python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# fill in API_URL, VIDEOGEN_WORKER_SECRET, BLOB_READ_WRITE_TOKEN,
# COMFYUI_DIR, COMFYUI_PYTHON, WATERMARK_PATH
source .env
python -u ltx-worker.py

Or install as a user-level systemd service — see systemd/ltx-worker.service.

ComfyUI must be reachable on $COMFYUI_URL (default http://localhost:8188) or installable at $COMFYUI_DIR; the worker boots it on demand. Required ComfyUI custom-node packs:

  • ComfyUI-GGUFDualCLIPLoaderGGUF
  • ComfyUI-LTXVideoLTXV* nodes, ManualSigmas
  • ComfyUI-KJNodesLTXVImgToVideoInplaceKJ, LTX2_NAG
  • ComfyUI-VideoHelperSuiteVHS_VideoCombine

Notes

  • Polling target shape: { job: { id, stageData: { frames: string[], clipPrompts: string[], watermark: bool } } | null }. A back-compat fallback handles the older {imageUrl, endImageUrl, klingPrompt} single- clip stageData shape.
  • Output Blob path: spatial-story/output/{jobId}.mp4.
  • Per-clip ComfyUI budget is 10 min; per-job wall-clock budget is 12 min. Both are checked between clips so a stuck job fails fast rather than hanging the queue.
  • The included workflows/ltx23_flf2v_api.json is the original GGUF workflow from the 3060 era — the worker builds its own FP8 workflow at runtime, but the JSON is useful as a node-graph reference when debugging in the ComfyUI UI.

License

MIT — see LICENSE.

Author: Gal Cohen — github.com/outlast85

About

LTX-2.3 22B distilled FP8 first-last-frame video worker — ComfyUI backend, FLF2V rigid frame injection, Blackwell-tuned

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages