[pull] main from huggingface:main#41
Open
pull[bot] wants to merge 437 commits into
Open
Conversation
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Join our Discord community for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
- Change tensor b from [1,2] row vector to [2,1] column vector - Fix assertion to match expected result after column replacement - Resolves shape mismatch error that prevented example from running
* Add simple atomics to ulong via atomic_uintx2 struct * Remove u32::max restriction from metal device.set_seed
… builds (#3059) * Use OUT_DIR for generated PTX bindings * fix: fixed the out_dir cargo problem in examples * fix: added imports in build.rs
* Initial metal-rs -> objc2-metal conversion * Using objc2_metal bindings in metal kernels * Use objc2_metal for mlx kernels * Use objc2_metal for tests * Use objc2_metal for metal benchmarks * tidy * Remove AllocationError. Use existing FailedToCreateResource * All candle-metal-kernels tests passing * Fix set_threadgroup_memory_length, fmt * Update cargo tomls with objc2 libs * Update candle-core metal usage * impl Send/Sync for metal Device and Library structs * tidy up imports --------- Co-authored-by: Kyle Birnbaum <kb@huggingface.co>
* Add cpu flash attention * Add test * Format * Fix docs shape
* put ug cuda behind cuda flag * revert to ug 0.0.2 when on ios * if only use ug if target_os is not ios added to wasm check already there
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
* Initial metal-rs -> objc2-metal conversion * Using objc2_metal bindings in metal kernels * Use objc2_metal for mlx kernels * Use objc2_metal for tests * Use objc2_metal for metal benchmarks * tidy * Remove AllocationError. Use existing FailedToCreateResource * All candle-metal-kernels tests passing * Fix set_threadgroup_memory_length, fmt * Update cargo tomls with objc2 libs * Update candle-core metal usage * impl Send/Sync for metal Device and Library structs * tidy up imports * Move .metal files to src/kernels/ * Begin refactor of candle-metal-kernels * Refactor candle-core metal usage * Delete old tmp folder * Extract library to its own file * Tidy up imports * Refactor metal buffer concepts * Refactor metal commandbuffer * Refactor metal blit and compute command encoders * Refactor metal compute pipeline * Refactor metal commands struct * Refactor metal Kernels * Refactor kernel Source * Extract MetalKernelError to err file * Rename kernels/ -> metal_src/ * Add kernels folder for specific kernel call impls * Move unary impls into kernels/unary.rs * Move binary impls into kernels/binary.rs * Move ConstantValues impl to metal::library * Move sdpa impls into kernels::sdpa * Move quantized impls into kernels::quantized * Move cast impls into kernels::cast * Move reduce impls into kernels::reduce Technically not all of these are reduce ops. That will have to wait for another day. * Move copy into kernels::cast. Simplify imports * Move affine impls into kernels::affine * Move ternary impls into kernels::ternary * Move indexing impls into kernels::indexing * Simplify imports * Move random impls into kernels::random * Move conv impls into kernels::convolution Again several impls are not specifically convolution. Another day. * Move fill impl into kernels::fill
clean candle-core typos.
[Metal] Ensure tensors are send/sync
Fix broken slice_scatter example in basics.rs
Update cudarc to v0.17.3 which has support for CUDA 13.
* Add scalar support to metal binary kernels * Add Layout::is_scalar and is_scalar_like helpers. Let kernel name decide dispatch in metal binary
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* Add vec_dot benchmark * Improve neon cpu impl for f32/f16/bf16 Uses inline asm where rust API is unstable. When fp16/bf16 target features are missing we use load into f32 fallback. Slight improvements to generic vec_dot algorithm (easier for compiler to unroll/vectorize) * Simplify Cpu* simd trait abstractions * Add debug print to track down windows CI bug. Remove later * hail mary avx attempt without an avx machine * Fix vec_dot_f16/bf16 CurrentCpuF16/BF16::STEP usage * Temporarily break AVX CurrentCpuBF16 to investigate * bug confirmed, adding transmute * Use is_x86_feature_detected instead of cfg gate
* Cap allocations in the GGUF loader Fixes #3533. The loader passed caller-controlled length fields into allocation calls without bounds checks. Adds size caps matching ggml-org/llama.cpp#19856, a remaining-bytes check, a GGML_MAX_DIMS cap on tensor dimensions, and a recursion depth cap on Value::Array. * Avoid re-seeking on every GGUF length check Capture the file size once in Content::read and pass it through read_string/Value::read instead of seeking to the end and back on every length-prefixed read, which roughly doubled load time. --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
) Updates the requirements on [enterpolation](https://github.com/NicolasKlenert/enterpolation) to permit the latest version. - [Release notes](https://github.com/NicolasKlenert/enterpolation/releases) - [Changelog](https://github.com/NicolasKlenert/enterpolation/blob/main/RELEASES.md) - [Commits](https://github.com/NicolasKlenert/enterpolation/commits/v0.3.0) --- updated-dependencies: - dependency-name: enterpolation dependency-version: 0.3.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* chore(deps): update web-sys requirement from =0.3.70 to =0.3.99 Updates the requirements on [web-sys](https://github.com/wasm-bindgen/wasm-bindgen) to permit the latest version. - [Release notes](https://github.com/wasm-bindgen/wasm-bindgen/releases) - [Changelog](https://github.com/wasm-bindgen/wasm-bindgen/blob/main/CHANGELOG.md) - [Commits](https://github.com/wasm-bindgen/wasm-bindgen/commits) --- updated-dependencies: - dependency-name: web-sys dependency-version: 0.3.99 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Use set_stroke_style_str over deprecated set_stroke_style --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ivar Flakstad <69173633+ivarflakstad@users.noreply.github.com>
Updates the requirements on [gloo](https://github.com/rustwasm/gloo) to permit the latest version. - [Release notes](https://github.com/rustwasm/gloo/releases) - [Changelog](https://github.com/ranile/gloo/blob/master/CHANGELOG.md) - [Commits](https://github.com/rustwasm/gloo/commits) --- updated-dependencies: - dependency-name: gloo dependency-version: 0.11.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* Add LFM2.5 (Liquid Foundation Model 2.5) support Add model implementation and example for LiquidAI's LFM2.5 hybrid architecture that combines attention and short convolution layers. Supports LFM2.5-1.2B and LFM2.5-1.2B-Thinking variants. * fix typo issue and crate::utils::build_causal_mask(seq_len, index_pos, device) as quantized_lfm2.rs * Apply rustfmt Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Fix clippy warnings (manual_div_ceil, large_enum_variant) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Add contiguous binary add bench * Add NdIter - efficient multidim iterator * Update cpu unary and binary op traits. Add binary default vec and scalar vec impls. Remove const delegation flags * Add more optimized cpu kernels for binary add and binary scalar add * Wire up NdIter in cpu unary/binary. Also add binary scalar vec path. * Move NdIter to its own file * clippy
* Add Candle specific rayon threadpool. Use less cores by default (for now)
* Use candle threadpool via device.with_context(|| {}) in quantized-qwen3 example
* Have cpu flash attn use shared candle threadpool
* metal: fix CPU readback race under concurrent command submission
CPU readbacks (MetalStorage::to_cpu, QMetalStorage::{data, dequantize})
encoded a blit into the shared rotating command buffer and then called
wait_until_completed, which commits the current buffer and waits on the
last in-flight one. With concurrent submissions from other threads, the
in-flight list can be taken by another thread's flush_and_wait between
the blit encode and the wait, causing the reader to return before its
blit has executed and to read stale or unwritten destination memory.
Fix: add Commands::flush_and_wait_current, which commits the current
command buffer while holding the state lock and waits on that specific
buffer. Command queues execute buffers in commit order, so completion
of this buffer also covers anything committed before it by other
threads. Completed buffers are drained from the in-flight list to keep
it bounded under readback-heavy workloads; errored buffers are kept for
reporting. The three readback sites now use it.
The new concurrent readback tests fail on every thread within a few
iterations against the previous behavior.
* Remove mistake
…en (fixes NaN logits) (#3625) The Metal steel attention prefill kernel (`call_sdpa_full`) produces NaN logits at periodic sequence lengths when invoked with BOTH an explicit additive mask AND `do_causal = true`. An explicit causal/sliding-window mask already encodes causality. Passing `do_causal = true` additionally enables the in-kernel causal handling (the `kb_lim` block-truncation plus the causal-triangle masking loop in `scaled_dot_product_attention.metal`). The two causal mechanisms overlap and, at lengths where the partial K-tile boundary lines up unfavourably, a query row ends up with every score masked to -inf. Its softmax normalizer `sum(exp) == 0`, and the final `row_bin_op<DivOp>(sum_score)` computes `0 / 0 = NaN`, poisoning the logits. The failure is periodic in total length and quant-independent. Fix: when an explicit mask is present, skip the redundant in-kernel `do_causal` path. The mask still enforces causality, so there is no correctness change, and no measurable performance change. Verified on Apple M5 Max (Gemma-4-E4B via mistral.rs): - deterministic 38,000-char-prompt repro: FAIL -> PASS (Q4K and Q8_0 ISQ) - 48-length sweep across 5 periodic windows: 19/48 FAIL -> 0/48 (Q4K), 0/48 (Q8_0) - clean-decode throughput unchanged (~70-74 tok/s)
* metal: route q_seq > 1 through full SDPA kernel, not vector * fmt: apply rustfmt to sdpa regression tests * metal: drop sdpa-dispatch comment and regression tests per review Per @ivarflakstad on #3479. The functional fix collapses to two predicate edits in candle-nn/src/ops.rs. --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* quantized: fix UAF in `as_t_slice` when called with `Cow::Owned` * quantized: drop explanatory comments and Miri test --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
…ble buffer builders on metal (#3542) * [Metal] Feature-gated debug groups + buffer labels (default-off) Opt-in Metal GPU-capture instrumentation behind `debug-labels` feature flag (candle-metal-kernels) and `metal-debug-labels` (candle-core, implies `metal`) * metal: trim restating rustdoc on BufferBuilder --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
Updates the requirements on [safetensors](https://github.com/huggingface/safetensors) to permit the latest version. - [Release notes](https://github.com/huggingface/safetensors/releases) - [Changelog](https://github.com/safetensors/safetensors/blob/main/RELEASE.md) - [Commits](safetensors/safetensors@v0.7.0...v0.8.0) --- updated-dependencies: - dependency-name: safetensors dependency-version: 0.8.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add neon optimized path to `BlockQ8K::from_float` * Give quantization tests a bit of wiggle-room via `compare_with_error` * Use vcvtaq_s32_f32 over vcvtnq_s32_f32 to match f32::round behaviour * Retain max-magnitude sign correctly. Restore tests to their original state
* Add persistent barrier based threadpool with per CPU cluster tree fan-out/reduce * Add GgmlType::vec_dot_2 and ::vec_dot_4 (with default impls). Add reusable scratch buffer for lhs of quantized matmul * call_lock now contains gen counter, avoiding silent dependency that root_workers[0] == 0. Add Barrier pool thread early exit path. * Add thread parking based on spin limit + per thread panic catching * Store panic payload immediately to avoid race
* add dot.rs; rm 4 dupe dot impl; migrate dot_32 call site; lean dispatch standard cpu_flash * softmax helper; impl in 13 locations; dot error compound bug resolved * drop f64 path from qwen3 dispatch * rm 2 line internal comment * fmt standard and dot * remove dead sum; 12 generic calls switched to D-type native conversion * fmt * attention score keeps full f32 range/precision * no exclusion of f64 so have bail
…around entire fwd pass instead to access the shared threadpool (#3634)
* Parameter cache for CUDA kernel launch * Enforce strict h2d/d2h invariants during CUDA graph capture * Revise guard drop * Remove clear_cuda_param_cache and d2h blocker * Add capture check for all memcpy actions * Unified guard
* Support embedding forward pass for ggml quants * Add cuda
* Fix BlockQ8K::from_float iscale/precision * Add neon dotprod optimization utils and optimized BlockQ8_0::vec_dot_4 path with neon vec_dot_4_q8_0_q8_0 * Optimized BlockQ4K::vec_dot_4 path with neon vec_dot_4_q4k_q8k * Optimized BlockQ6K::vec_dot_4 path with neon vec_dot_4_q6k_q8k * Add interleaved/repacked matmul_q4k_x8 with neon optimized path * Slightly improved neon quantize_row_q8k * Add matmul_q4k_x8 path to QTensor matmul call * No need to unsqueeze and transpose before reshape * Have candle barrier pool respond to different env var than rayon * Use candle threadpool in cpu causal FA. Other causal FA improvements * Add serial path to cpu rms norm * Add serial path to cpu rope * Update quantized test. Crossed fingers non-neon targets were affected identically * clippy * Fix pack_to_q4kx8 alignment. Return BlockQ4Kx8. Add vec_to_bytes util. * Update quantize_q8k test. * Make vec_to_bytes copy data to ensure no UB * Standard vec initialization in pack_to_q4kx8 * Use zerocopy to ensure correct alignment (already a transitive dep) * Remove unused byte/vec conversion fns
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )