Skip to content

[pull] main from huggingface:main#41

Open
pull[bot] wants to merge 437 commits into
EricLBuehler:mainfrom
huggingface:main
Open

[pull] main from huggingface:main#41
pull[bot] wants to merge 437 commits into
EricLBuehler:mainfrom
huggingface:main

Conversation

@pull

@pull pull Bot commented Nov 19, 2024

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

@coderabbitai

coderabbitai Bot commented May 8, 2025

Copy link
Copy Markdown

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Join our Discord community for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

lucky-bai and others added 26 commits August 18, 2025 14:08
- Change tensor b from [1,2] row vector to [2,1] column vector
- Fix assertion to match expected result after column replacement
- Resolves shape mismatch error that prevented example from running
* Add simple atomics to ulong via atomic_uintx2 struct

* Remove u32::max restriction from metal device.set_seed
… builds (#3059)

* Use OUT_DIR for generated PTX bindings

* fix: fixed the out_dir cargo problem in examples

* fix: added imports in build.rs
* Initial metal-rs -> objc2-metal conversion

* Using objc2_metal bindings in metal kernels

* Use objc2_metal for mlx kernels

* Use objc2_metal for tests

* Use objc2_metal for metal benchmarks

* tidy

* Remove AllocationError. Use existing FailedToCreateResource

* All candle-metal-kernels tests passing

* Fix set_threadgroup_memory_length, fmt

* Update cargo tomls with objc2 libs

* Update candle-core metal usage

* impl Send/Sync for metal Device and Library structs

* tidy up imports

---------

Co-authored-by: Kyle Birnbaum <kb@huggingface.co>
* Add cpu flash attention

* Add test

* Format

* Fix docs shape
* put ug cuda behind cuda flag

* revert to ug 0.0.2 when on ios

* if only use ug if target_os is not ios added to wasm check already there
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
* Initial metal-rs -> objc2-metal conversion

* Using objc2_metal bindings in metal kernels

* Use objc2_metal for mlx kernels

* Use objc2_metal for tests

* Use objc2_metal for metal benchmarks

* tidy

* Remove AllocationError. Use existing FailedToCreateResource

* All candle-metal-kernels tests passing

* Fix set_threadgroup_memory_length, fmt

* Update cargo tomls with objc2 libs

* Update candle-core metal usage

* impl Send/Sync for metal Device and Library structs

* tidy up imports

* Move .metal files to src/kernels/

* Begin refactor of candle-metal-kernels

* Refactor candle-core metal usage

* Delete old tmp folder

* Extract library to its own file

* Tidy up imports

* Refactor metal buffer concepts

* Refactor metal commandbuffer

* Refactor metal blit and compute command encoders

* Refactor metal compute pipeline

* Refactor metal commands struct

* Refactor metal Kernels

* Refactor kernel Source

* Extract MetalKernelError to err file

* Rename kernels/ -> metal_src/

* Add kernels folder for specific kernel call impls

* Move unary impls into kernels/unary.rs

* Move binary impls into kernels/binary.rs

* Move ConstantValues impl to metal::library

* Move sdpa impls into kernels::sdpa

* Move quantized impls into kernels::quantized

* Move cast impls into kernels::cast

* Move reduce impls into kernels::reduce

Technically not all of these are reduce ops.
That will have to wait for another day.

* Move copy into kernels::cast. Simplify imports

* Move affine impls into kernels::affine

* Move ternary impls into kernels::ternary

* Move indexing impls into kernels::indexing

* Simplify imports

* Move random impls into kernels::random

* Move conv impls into kernels::convolution

Again several impls are not specifically convolution.
Another day.

* Move fill impl into kernels::fill
Fix broken slice_scatter example in basics.rs
Update cudarc to v0.17.3 which has support for CUDA 13.
ivarflakstad and others added 30 commits May 28, 2026 22:23
* Add scalar support to metal binary kernels

* Add Layout::is_scalar and is_scalar_like helpers. Let kernel name decide dispatch in metal binary
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* Add vec_dot benchmark

* Improve neon cpu impl for f32/f16/bf16

Uses inline asm where rust API is unstable.
When fp16/bf16 target features are missing we use load into f32 fallback.
Slight improvements to generic vec_dot algorithm (easier for compiler to unroll/vectorize)

* Simplify Cpu* simd trait abstractions

* Add debug print to track down windows CI bug. Remove later

* hail mary avx attempt without an avx machine

* Fix vec_dot_f16/bf16 CurrentCpuF16/BF16::STEP usage

* Temporarily break AVX CurrentCpuBF16 to investigate

* bug confirmed, adding transmute

* Use is_x86_feature_detected instead of cfg gate
* Cap allocations in the GGUF loader

Fixes #3533. The loader passed caller-controlled length fields into
allocation calls without bounds checks. Adds size caps matching
ggml-org/llama.cpp#19856, a remaining-bytes check, a GGML_MAX_DIMS
cap on tensor dimensions, and a recursion depth cap on Value::Array.

* Avoid re-seeking on every GGUF length check

Capture the file size once in Content::read and pass it through
read_string/Value::read instead of seeking to the end and back on
every length-prefixed read, which roughly doubled load time.

---------

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
)

Updates the requirements on [enterpolation](https://github.com/NicolasKlenert/enterpolation) to permit the latest version.
- [Release notes](https://github.com/NicolasKlenert/enterpolation/releases)
- [Changelog](https://github.com/NicolasKlenert/enterpolation/blob/main/RELEASES.md)
- [Commits](https://github.com/NicolasKlenert/enterpolation/commits/v0.3.0)

---
updated-dependencies:
- dependency-name: enterpolation
  dependency-version: 0.3.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* chore(deps): update web-sys requirement from =0.3.70 to =0.3.99

Updates the requirements on [web-sys](https://github.com/wasm-bindgen/wasm-bindgen) to permit the latest version.
- [Release notes](https://github.com/wasm-bindgen/wasm-bindgen/releases)
- [Changelog](https://github.com/wasm-bindgen/wasm-bindgen/blob/main/CHANGELOG.md)
- [Commits](https://github.com/wasm-bindgen/wasm-bindgen/commits)

---
updated-dependencies:
- dependency-name: web-sys
  dependency-version: 0.3.99
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Use set_stroke_style_str over deprecated set_stroke_style

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ivar Flakstad <69173633+ivarflakstad@users.noreply.github.com>
Updates the requirements on [gloo](https://github.com/rustwasm/gloo) to permit the latest version.
- [Release notes](https://github.com/rustwasm/gloo/releases)
- [Changelog](https://github.com/ranile/gloo/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rustwasm/gloo/commits)

---
updated-dependencies:
- dependency-name: gloo
  dependency-version: 0.11.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* Add LFM2.5 (Liquid Foundation Model 2.5) support

Add model implementation and example for LiquidAI's LFM2.5 hybrid
architecture that combines attention and short convolution layers.

Supports LFM2.5-1.2B and LFM2.5-1.2B-Thinking variants.

* fix typo issue and crate::utils::build_causal_mask(seq_len, index_pos, device) as quantized_lfm2.rs

* Apply rustfmt

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Fix clippy warnings (manual_div_ceil, large_enum_variant)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Add contiguous binary add bench

* Add NdIter - efficient multidim iterator

* Update cpu unary and binary op traits. Add binary default vec and scalar vec impls. Remove const delegation flags

* Add more optimized cpu kernels for binary add and binary scalar add

* Wire up NdIter in cpu unary/binary. Also add binary scalar vec path.

* Move NdIter to its own file

* clippy
* Add Candle specific rayon threadpool. Use less cores by default (for now)

* Use candle threadpool via device.with_context(|| {}) in quantized-qwen3 example

* Have cpu flash attn use shared candle threadpool
* metal: fix CPU readback race under concurrent command submission

CPU readbacks (MetalStorage::to_cpu, QMetalStorage::{data, dequantize})
encoded a blit into the shared rotating command buffer and then called
wait_until_completed, which commits the current buffer and waits on the
last in-flight one. With concurrent submissions from other threads, the
in-flight list can be taken by another thread's flush_and_wait between
the blit encode and the wait, causing the reader to return before its
blit has executed and to read stale or unwritten destination memory.

Fix: add Commands::flush_and_wait_current, which commits the current
command buffer while holding the state lock and waits on that specific
buffer. Command queues execute buffers in commit order, so completion
of this buffer also covers anything committed before it by other
threads. Completed buffers are drained from the in-flight list to keep
it bounded under readback-heavy workloads; errored buffers are kept for
reporting. The three readback sites now use it.

The new concurrent readback tests fail on every thread within a few
iterations against the previous behavior.

* Remove mistake
…en (fixes NaN logits) (#3625)

The Metal steel attention prefill kernel (`call_sdpa_full`) produces NaN
logits at periodic sequence lengths when invoked with BOTH an explicit
additive mask AND `do_causal = true`.

An explicit causal/sliding-window mask already encodes causality. Passing
`do_causal = true` additionally enables the in-kernel causal handling
(the `kb_lim` block-truncation plus the causal-triangle masking loop in
`scaled_dot_product_attention.metal`). The two causal mechanisms overlap
and, at lengths where the partial K-tile boundary lines up unfavourably, a
query row ends up with every score masked to -inf. Its softmax normalizer
`sum(exp) == 0`, and the final `row_bin_op<DivOp>(sum_score)` computes
`0 / 0 = NaN`, poisoning the logits. The failure is periodic in total
length and quant-independent.

Fix: when an explicit mask is present, skip the redundant in-kernel
`do_causal` path. The mask still enforces causality, so there is no
correctness change, and no measurable performance change.

Verified on Apple M5 Max (Gemma-4-E4B via mistral.rs):
- deterministic 38,000-char-prompt repro: FAIL -> PASS (Q4K and Q8_0 ISQ)
- 48-length sweep across 5 periodic windows: 19/48 FAIL -> 0/48 (Q4K),
  0/48 (Q8_0)
- clean-decode throughput unchanged (~70-74 tok/s)
* metal: route q_seq > 1 through full SDPA kernel, not vector

* fmt: apply rustfmt to sdpa regression tests

* metal: drop sdpa-dispatch comment and regression tests per review

Per @ivarflakstad on #3479. The functional fix collapses to two
predicate edits in candle-nn/src/ops.rs.

---------

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* quantized: fix UAF in `as_t_slice` when called with `Cow::Owned`

* quantized: drop explanatory comments and Miri test

---------

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
…ble buffer builders on metal (#3542)

* [Metal] Feature-gated debug groups + buffer labels (default-off)

Opt-in Metal GPU-capture instrumentation behind `debug-labels` feature flag (candle-metal-kernels) and
`metal-debug-labels` (candle-core, implies `metal`)

* metal: trim restating rustdoc on BufferBuilder

---------

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
Updates the requirements on [safetensors](https://github.com/huggingface/safetensors) to permit the latest version.
- [Release notes](https://github.com/huggingface/safetensors/releases)
- [Changelog](https://github.com/safetensors/safetensors/blob/main/RELEASE.md)
- [Commits](safetensors/safetensors@v0.7.0...v0.8.0)

---
updated-dependencies:
- dependency-name: safetensors
  dependency-version: 0.8.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add neon optimized path to `BlockQ8K::from_float`

* Give quantization tests a bit of wiggle-room via `compare_with_error`

* Use vcvtaq_s32_f32 over vcvtnq_s32_f32 to match f32::round behaviour

* Retain max-magnitude sign correctly. Restore tests to their original state
* Add persistent barrier based threadpool with per CPU cluster tree fan-out/reduce

* Add GgmlType::vec_dot_2 and ::vec_dot_4 (with default impls). Add reusable scratch buffer for lhs of quantized matmul

* call_lock now contains gen counter, avoiding silent dependency that root_workers[0] == 0. Add Barrier pool thread early exit path.

* Add thread parking based on spin limit + per thread panic catching

* Store panic payload immediately to avoid race
* add dot.rs; rm 4 dupe dot impl; migrate dot_32 call site; lean dispatch standard cpu_flash

* softmax helper; impl in 13 locations; dot error compound bug resolved

* drop f64 path from qwen3 dispatch

* rm 2 line internal comment

* fmt standard and dot

* remove dead sum; 12 generic calls switched to D-type native conversion

* fmt

* attention score keeps full f32 range/precision

* no exclusion of f64 so have bail
…around entire fwd pass instead to access the shared threadpool (#3634)
* Parameter cache for CUDA kernel launch

* Enforce strict h2d/d2h invariants during CUDA graph capture

* Revise guard drop

* Remove clear_cuda_param_cache and d2h blocker

* Add capture check for all memcpy actions

* Unified guard
* Support embedding forward pass for ggml quants

* Add cuda
* Fix BlockQ8K::from_float iscale/precision

* Add neon dotprod optimization utils and optimized BlockQ8_0::vec_dot_4 path with neon vec_dot_4_q8_0_q8_0

* Optimized BlockQ4K::vec_dot_4 path with neon vec_dot_4_q4k_q8k

* Optimized BlockQ6K::vec_dot_4 path with neon vec_dot_4_q6k_q8k

* Add interleaved/repacked matmul_q4k_x8 with neon optimized path

* Slightly improved neon quantize_row_q8k

* Add matmul_q4k_x8 path to QTensor matmul call

* No need to unsqueeze and transpose before reshape

* Have candle barrier pool respond to different env var than rayon

* Use candle threadpool in cpu causal FA. Other causal FA improvements

* Add serial path to cpu rms norm

* Add serial path to cpu rope

* Update quantized test. Crossed fingers non-neon targets were affected identically

* clippy

* Fix pack_to_q4kx8 alignment. Return BlockQ4Kx8. Add vec_to_bytes util.

* Update quantize_q8k test.

* Make vec_to_bytes copy data to ensure no UB

* Standard vec initialization in pack_to_q4kx8

* Use zerocopy to ensure correct alignment (already a transitive dep)

* Remove unused byte/vec conversion fns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⤵️ pull merge-conflict Resolve conflicts manually

Projects

None yet

Development

Successfully merging this pull request may close these issues.