A2v sound encoder inference#39
Open
lfengad wants to merge 26 commits into
Open
Conversation
Conditions video generation on a real input audio clip + input image using the nano_diffusers_sound_encoder checkpoint. Reuses the existing ts2v sound condition plan plus preserved image first-frame conditioning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ner) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Use ResolvedFilePathOrUrl (matching vision_path) for the SoundDataOverrides sound_path field; fix unit tests to use a URL / real tmp_path files. Add missing defaults/audio_image2video/sample_args.json (copied from image2video with enable_sound=true) required by build_sample(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reads an arbitrary audio file via soundfile, resamples with scipy.signal.resample_poly (avoiding torchaudio which is absent from the inference container), conforms channel count, and trim/pads to an exact sample count. Covered by two new pytest tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add keyword-only parameter `condition_sound: bool = False` that selects mode "ts2v" (all sound conditioned, video generated) when True, preserving the existing "t2vs" default (joint generation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "AVAE" checkpoint registration pointed at nvidia/Cosmos3-Nano/sound_tokenizer, which is decoder-only (182 decoder keys, 0 encoder keys). With strict=False the 67 encoder params silently stayed at random init, so encode(input_audio) produced noise-dominated latents (encode-twice absdiff 0.79; decode(encode(sine)) corr 0.002) and audio_image2video output bore no relation to the conditioning clip. Sound generation (t2vs) was unaffected because it only decodes diffusion latents. Point the AVAE source at nvidia/Cosmos3-Experimental/nano_diffusers_sound_encoder/ sound_tokenizer, which ships the full AVAE: diffusers OobleckDecoder (decoder.block.*) plus the native SpecConvNeXt encoder (encoder.layers.*). _materialize_avae_ckpt already remaps decoder keys and passes the native encoder keys through unchanged. Verified: AVAE round-trip sine corr 1.000, real-audio envelope corr 0.998 (25/25 taps), encode determinism absdiff 0.009; end-to-end A2V output audio envelope corr 0.940 vs the conditioning clip. Also update the i2vs a2v example (hammer image + metallic-tapping audio). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revert checkpoint/doc references from nvidia/Cosmos3-Experimental (nano_diffusers_sound_encoder) back to the standard Cosmos3-Nano: drop the Cosmos3-Nano-SoundEncoder registry entry, point the AVAE sound_tokenizer source at nvidia/Cosmos3-Nano/sound_tokenizer, and drop the SoundEncoder rows from docs. All A2V logic is unchanged (audio_image2video mode, sound_path input, load_conditioning_audio, condition_sound wiring, encoder-key passthrough in _materialize_avae_ckpt). The published Cosmos3-Nano sound_tokenizer will be updated to ship the full AVAE (encoder + decoder), at which point A2V works on the default checkpoint with no further changes. Also remove the brainstorming design spec. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…documented Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The native encoder.layers.* keys load when present (enabling audio_image2video); the current Cosmos3-Nano sound_tokenizer ships only the decoder, so A2V produces faithful audio only once the checkpoint is updated to include the encoder. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d_tokenizer/ By default, use the sound_tokenizer/ co-located in the main model checkpoint (matched to the transformer; uses whatever encoder/decoder that checkpoint ships) instead of the global "AVAE" registry repo. Mirrors vlm_processor_from_checkpoint: after download_checkpoint(), materialize the legacy AVAE .ckpt from the bundled sound_tokenizer/ and point avae_path at it (download_checkpoint_v2 short-circuits local paths). Falls back to the registry when the checkpoint bundles no sound_tokenizer/. So Cosmos3-Super uses Super's sound_tokenizer, Nano uses Nano's. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The sound_tokenizer node accepts an inference-only 'from_checkpoint' key (default True): True sources the AVAE from the loaded checkpoint's bundled sound_tokenizer/; False keeps the configured avae_path (the registered "AVAE" repo) even when the checkpoint bundles one. The key is popped before AVAEInterface instantiation. Set it in the inference model YAML to override. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Default precedence: use the registered "AVAE" repo (configured avae_path) when one is registered; fall back to the loaded checkpoint's bundled sound_tokenizer/ only when no AVAE is registered. The inference-only sound_tokenizer.from_checkpoint key (default False) forces the bundled one even when an AVAE is registered. Backward-compatible: with the "AVAE" entry registered (default), all checkpoints use the registry AVAE exactly as before; bundled is opt-in (from_checkpoint:true) or used when the registration is removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Use the configured avae_path when set (registered AVAE); fall back to the checkpoint's bundled sound_tokenizer/ only when avae_path is empty, or when from_checkpoint:true forces it. Drops the s3-uri/registry lookup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f40d9e9 to
3fb1fbe
Compare
Dinghow
approved these changes
Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.