Skip to content

A2v sound encoder inference#39

Open
lfengad wants to merge 26 commits into
mainfrom
a2v-sound-encoder-inference
Open

A2v sound encoder inference#39
lfengad wants to merge 26 commits into
mainfrom
a2v-sound-encoder-inference

Conversation

@lfengad

@lfengad lfengad commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

lfengad and others added 25 commits June 11, 2026 22:30
Conditions video generation on a real input audio clip + input image using
the nano_diffusers_sound_encoder checkpoint. Reuses the existing ts2v sound
condition plan plus preserved image first-frame conditioning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ner)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Use ResolvedFilePathOrUrl (matching vision_path) for the SoundDataOverrides
sound_path field; fix unit tests to use a URL / real tmp_path files.
Add missing defaults/audio_image2video/sample_args.json (copied from
image2video with enable_sound=true) required by build_sample().

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reads an arbitrary audio file via soundfile, resamples with
scipy.signal.resample_poly (avoiding torchaudio which is absent
from the inference container), conforms channel count, and
trim/pads to an exact sample count. Covered by two new pytest tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add keyword-only parameter `condition_sound: bool = False` that selects
mode "ts2v" (all sound conditioned, video generated) when True, preserving
the existing "t2vs" default (joint generation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "AVAE" checkpoint registration pointed at nvidia/Cosmos3-Nano/sound_tokenizer,
which is decoder-only (182 decoder keys, 0 encoder keys). With strict=False the
67 encoder params silently stayed at random init, so encode(input_audio) produced
noise-dominated latents (encode-twice absdiff 0.79; decode(encode(sine)) corr 0.002)
and audio_image2video output bore no relation to the conditioning clip. Sound
generation (t2vs) was unaffected because it only decodes diffusion latents.

Point the AVAE source at nvidia/Cosmos3-Experimental/nano_diffusers_sound_encoder/
sound_tokenizer, which ships the full AVAE: diffusers OobleckDecoder (decoder.block.*)
plus the native SpecConvNeXt encoder (encoder.layers.*). _materialize_avae_ckpt
already remaps decoder keys and passes the native encoder keys through unchanged.

Verified: AVAE round-trip sine corr 1.000, real-audio envelope corr 0.998 (25/25
taps), encode determinism absdiff 0.009; end-to-end A2V output audio envelope corr
0.940 vs the conditioning clip. Also update the i2vs a2v example (hammer image +
metallic-tapping audio).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revert checkpoint/doc references from nvidia/Cosmos3-Experimental
(nano_diffusers_sound_encoder) back to the standard Cosmos3-Nano: drop the
Cosmos3-Nano-SoundEncoder registry entry, point the AVAE sound_tokenizer source
at nvidia/Cosmos3-Nano/sound_tokenizer, and drop the SoundEncoder rows from
docs. All A2V logic is unchanged (audio_image2video mode, sound_path input,
load_conditioning_audio, condition_sound wiring, encoder-key passthrough in
_materialize_avae_ckpt). The published Cosmos3-Nano sound_tokenizer will be
updated to ship the full AVAE (encoder + decoder), at which point A2V works on
the default checkpoint with no further changes.

Also remove the brainstorming design spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…documented

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The native encoder.layers.* keys load when present (enabling audio_image2video);
the current Cosmos3-Nano sound_tokenizer ships only the decoder, so A2V produces
faithful audio only once the checkpoint is updated to include the encoder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d_tokenizer/

By default, use the sound_tokenizer/ co-located in the main model checkpoint
(matched to the transformer; uses whatever encoder/decoder that checkpoint ships)
instead of the global "AVAE" registry repo. Mirrors vlm_processor_from_checkpoint:
after download_checkpoint(), materialize the legacy AVAE .ckpt from the bundled
sound_tokenizer/ and point avae_path at it (download_checkpoint_v2 short-circuits
local paths). Falls back to the registry when the checkpoint bundles no
sound_tokenizer/. So Cosmos3-Super uses Super's sound_tokenizer, Nano uses Nano's.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The sound_tokenizer node accepts an inference-only 'from_checkpoint' key
(default True): True sources the AVAE from the loaded checkpoint's bundled
sound_tokenizer/; False keeps the configured avae_path (the registered "AVAE"
repo) even when the checkpoint bundles one. The key is popped before
AVAEInterface instantiation. Set it in the inference model YAML to override.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Default precedence: use the registered "AVAE" repo (configured avae_path) when
one is registered; fall back to the loaded checkpoint's bundled sound_tokenizer/
only when no AVAE is registered. The inference-only sound_tokenizer.from_checkpoint
key (default False) forces the bundled one even when an AVAE is registered.

Backward-compatible: with the "AVAE" entry registered (default), all checkpoints
use the registry AVAE exactly as before; bundled is opt-in (from_checkpoint:true)
or used when the registration is removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Use the configured avae_path when set (registered AVAE); fall back to the
checkpoint's bundled sound_tokenizer/ only when avae_path is empty, or when
from_checkpoint:true forces it. Drops the s3-uri/registry lookup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lfengad lfengad force-pushed the a2v-sound-encoder-inference branch from f40d9e9 to 3fb1fbe Compare June 12, 2026 10:44

@foreverlms foreverlms left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants