A2v sound encoder inference by lfengad · Pull Request #39 · NVIDIA/cosmos-framework

lfengad · 2026-06-12T10:37:45Z

No description provided.

Conditions video generation on a real input audio clip + input image using the nano_diffusers_sound_encoder checkpoint. Reuses the existing ts2v sound condition plan plus preserved image first-frame conditioning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ner) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Use ResolvedFilePathOrUrl (matching vision_path) for the SoundDataOverrides sound_path field; fix unit tests to use a URL / real tmp_path files. Add missing defaults/audio_image2video/sample_args.json (copied from image2video with enable_sound=true) required by build_sample(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Reads an arbitrary audio file via soundfile, resamples with scipy.signal.resample_poly (avoiding torchaudio which is absent from the inference container), conforms channel count, and trim/pads to an exact sample count. Covered by two new pytest tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add keyword-only parameter `condition_sound: bool = False` that selects mode "ts2v" (all sound conditioned, video generated) when True, preserving the existing "t2vs" default (joint generation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The "AVAE" checkpoint registration pointed at nvidia/Cosmos3-Nano/sound_tokenizer, which is decoder-only (182 decoder keys, 0 encoder keys). With strict=False the 67 encoder params silently stayed at random init, so encode(input_audio) produced noise-dominated latents (encode-twice absdiff 0.79; decode(encode(sine)) corr 0.002) and audio_image2video output bore no relation to the conditioning clip. Sound generation (t2vs) was unaffected because it only decodes diffusion latents. Point the AVAE source at nvidia/Cosmos3-Experimental/nano_diffusers_sound_encoder/ sound_tokenizer, which ships the full AVAE: diffusers OobleckDecoder (decoder.block.*) plus the native SpecConvNeXt encoder (encoder.layers.*). _materialize_avae_ckpt already remaps decoder keys and passes the native encoder keys through unchanged. Verified: AVAE round-trip sine corr 1.000, real-audio envelope corr 0.998 (25/25 taps), encode determinism absdiff 0.009; end-to-end A2V output audio envelope corr 0.940 vs the conditioning clip. Also update the i2vs a2v example (hammer image + metallic-tapping audio). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Revert checkpoint/doc references from nvidia/Cosmos3-Experimental (nano_diffusers_sound_encoder) back to the standard Cosmos3-Nano: drop the Cosmos3-Nano-SoundEncoder registry entry, point the AVAE sound_tokenizer source at nvidia/Cosmos3-Nano/sound_tokenizer, and drop the SoundEncoder rows from docs. All A2V logic is unchanged (audio_image2video mode, sound_path input, load_conditioning_audio, condition_sound wiring, encoder-key passthrough in _materialize_avae_ckpt). The published Cosmos3-Nano sound_tokenizer will be updated to ship the full AVAE (encoder + decoder), at which point A2V works on the default checkpoint with no further changes. Also remove the brainstorming design spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…documented Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The native encoder.layers.* keys load when present (enabling audio_image2video); the current Cosmos3-Nano sound_tokenizer ships only the decoder, so A2V produces faithful audio only once the checkpoint is updated to include the encoder. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d_tokenizer/ By default, use the sound_tokenizer/ co-located in the main model checkpoint (matched to the transformer; uses whatever encoder/decoder that checkpoint ships) instead of the global "AVAE" registry repo. Mirrors vlm_processor_from_checkpoint: after download_checkpoint(), materialize the legacy AVAE .ckpt from the bundled sound_tokenizer/ and point avae_path at it (download_checkpoint_v2 short-circuits local paths). Falls back to the registry when the checkpoint bundles no sound_tokenizer/. So Cosmos3-Super uses Super's sound_tokenizer, Nano uses Nano's. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The sound_tokenizer node accepts an inference-only 'from_checkpoint' key (default True): True sources the AVAE from the loaded checkpoint's bundled sound_tokenizer/; False keeps the configured avae_path (the registered "AVAE" repo) even when the checkpoint bundles one. The key is popped before AVAEInterface instantiation. Set it in the inference model YAML to override. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Default precedence: use the registered "AVAE" repo (configured avae_path) when one is registered; fall back to the loaded checkpoint's bundled sound_tokenizer/ only when no AVAE is registered. The inference-only sound_tokenizer.from_checkpoint key (default False) forces the bundled one even when an AVAE is registered. Backward-compatible: with the "AVAE" entry registered (default), all checkpoints use the registry AVAE exactly as before; bundled is opt-in (from_checkpoint:true) or used when the registration is removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Use the configured avae_path when set (registered AVAE); fall back to the checkpoint's bundled sound_tokenizer/ only when avae_path is empty, or when from_checkpoint:true forces it. Drops the s3-uri/registry lookup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

foreverlms

LGTM

lfengad and others added 25 commits June 11, 2026 22:30

Add implementation plan for audio_image2video (A2V) inference

10ff2bd

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Plan: resample via scipy not torchaudio (absent from inference contai…

2426292

…ner) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Register Cosmos3-Nano-SoundEncoder checkpoint

74679dd

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add audio_image2video model mode

c2e1813

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add sound_path input + audio_image2video arg validation

162a8c2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Load and condition on real input audio in get_sample_data

fb94018

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add audio_image2video (a2v) example input and conditioning audio

0ffc6c8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Document audio_image2video mode and sound-encoder checkpoint

753b715

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove A2V implementation plan doc

0bee46d

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove A2V example input (a2v.json + conditioning audio asset)

f3bfb20

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drop dead a2v.json links from inference docs, keep audio_image2video …

3165a99

…documented Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Simplify AVAE materialize comments

b62e992

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Simplify comments in A2V sound inference

634c83e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Revert docs/inference.md changes on this branch

3fb1fbe

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lfengad force-pushed the a2v-sound-encoder-inference branch from f40d9e9 to 3fb1fbe Compare June 12, 2026 10:44

Merge branch 'main' into a2v-sound-encoder-inference

664979c

foreverlms approved these changes Jun 12, 2026

View reviewed changes

Dinghow approved these changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A2v sound encoder inference#39

A2v sound encoder inference#39
lfengad wants to merge 26 commits into
mainfrom
a2v-sound-encoder-inference

lfengad commented Jun 12, 2026

Uh oh!

foreverlms left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lfengad commented Jun 12, 2026

Uh oh!

foreverlms left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants