[pull] main from huggingface:main by pull[bot] · Pull Request #41 · EricLBuehler/candle

pull · 2024-11-19T09:14:01Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

coderabbitai · 2025-05-08T13:06:18Z

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Join our Discord community for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

- Change tensor b from [1,2] row vector to [2,1] column vector - Fix assertion to match expected result after column replacement - Resolves shape mismatch error that prevented example from running

* Add simple atomics to ulong via atomic_uintx2 struct * Remove u32::max restriction from metal device.set_seed

…r correctly (#3069)

… builds (#3059) * Use OUT_DIR for generated PTX bindings * fix: fixed the out_dir cargo problem in examples * fix: added imports in build.rs

* Initial metal-rs -> objc2-metal conversion * Using objc2_metal bindings in metal kernels * Use objc2_metal for mlx kernels * Use objc2_metal for tests * Use objc2_metal for metal benchmarks * tidy * Remove AllocationError. Use existing FailedToCreateResource * All candle-metal-kernels tests passing * Fix set_threadgroup_memory_length, fmt * Update cargo tomls with objc2 libs * Update candle-core metal usage * impl Send/Sync for metal Device and Library structs * tidy up imports --------- Co-authored-by: Kyle Birnbaum <kb@huggingface.co>

* Add cpu flash attention * Add test * Format * Fix docs shape

* put ug cuda behind cuda flag * revert to ug 0.0.2 when on ios * if only use ug if target_os is not ios added to wasm check already there

Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>

… map

* Initial metal-rs -> objc2-metal conversion * Using objc2_metal bindings in metal kernels * Use objc2_metal for mlx kernels * Use objc2_metal for tests * Use objc2_metal for metal benchmarks * tidy * Remove AllocationError. Use existing FailedToCreateResource * All candle-metal-kernels tests passing * Fix set_threadgroup_memory_length, fmt * Update cargo tomls with objc2 libs * Update candle-core metal usage * impl Send/Sync for metal Device and Library structs * tidy up imports * Move .metal files to src/kernels/ * Begin refactor of candle-metal-kernels * Refactor candle-core metal usage * Delete old tmp folder * Extract library to its own file * Tidy up imports * Refactor metal buffer concepts * Refactor metal commandbuffer * Refactor metal blit and compute command encoders * Refactor metal compute pipeline * Refactor metal commands struct * Refactor metal Kernels * Refactor kernel Source * Extract MetalKernelError to err file * Rename kernels/ -> metal_src/ * Add kernels folder for specific kernel call impls * Move unary impls into kernels/unary.rs * Move binary impls into kernels/binary.rs * Move ConstantValues impl to metal::library * Move sdpa impls into kernels::sdpa * Move quantized impls into kernels::quantized * Move cast impls into kernels::cast * Move reduce impls into kernels::reduce Technically not all of these are reduce ops. That will have to wait for another day. * Move copy into kernels::cast. Simplify imports * Move affine impls into kernels::affine * Move ternary impls into kernels::ternary * Move indexing impls into kernels::indexing * Simplify imports * Move random impls into kernels::random * Move conv impls into kernels::convolution Again several impls are not specifically convolution. Another day. * Move fill impl into kernels::fill

clean candle-core typos.

[Metal] Ensure tensors are send/sync

Fix broken slice_scatter example in basics.rs

Update cudarc to v0.17.3 which has support for CUDA 13.

* Add scalar support to metal binary kernels * Add Layout::is_scalar and is_scalar_like helpers. Let kernel name decide dispatch in metal binary

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* Add vec_dot benchmark * Improve neon cpu impl for f32/f16/bf16 Uses inline asm where rust API is unstable. When fp16/bf16 target features are missing we use load into f32 fallback. Slight improvements to generic vec_dot algorithm (easier for compiler to unroll/vectorize) * Simplify Cpu* simd trait abstractions * Add debug print to track down windows CI bug. Remove later * hail mary avx attempt without an avx machine * Fix vec_dot_f16/bf16 CurrentCpuF16/BF16::STEP usage * Temporarily break AVX CurrentCpuBF16 to investigate * bug confirmed, adding transmute * Use is_x86_feature_detected instead of cfg gate

* Cap allocations in the GGUF loader Fixes #3533. The loader passed caller-controlled length fields into allocation calls without bounds checks. Adds size caps matching ggml-org/llama.cpp#19856, a remaining-bytes check, a GGML_MAX_DIMS cap on tensor dimensions, and a recursion depth cap on Value::Array. * Avoid re-seeking on every GGUF length check Capture the file size once in Content::read and pass it through read_string/Value::read instead of seeking to the end and back on every length-prefixed read, which roughly doubled load time. --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

) Updates the requirements on [enterpolation](https://github.com/NicolasKlenert/enterpolation) to permit the latest version. - [Release notes](https://github.com/NicolasKlenert/enterpolation/releases) - [Changelog](https://github.com/NicolasKlenert/enterpolation/blob/main/RELEASES.md) - [Commits](https://github.com/NicolasKlenert/enterpolation/commits/v0.3.0) --- updated-dependencies: - dependency-name: enterpolation dependency-version: 0.3.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* chore(deps): update web-sys requirement from =0.3.70 to =0.3.99 Updates the requirements on [web-sys](https://github.com/wasm-bindgen/wasm-bindgen) to permit the latest version. - [Release notes](https://github.com/wasm-bindgen/wasm-bindgen/releases) - [Changelog](https://github.com/wasm-bindgen/wasm-bindgen/blob/main/CHANGELOG.md) - [Commits](https://github.com/wasm-bindgen/wasm-bindgen/commits) --- updated-dependencies: - dependency-name: web-sys dependency-version: 0.3.99 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Use set_stroke_style_str over deprecated set_stroke_style --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ivar Flakstad <69173633+ivarflakstad@users.noreply.github.com>

Updates the requirements on [gloo](https://github.com/rustwasm/gloo) to permit the latest version. - [Release notes](https://github.com/rustwasm/gloo/releases) - [Changelog](https://github.com/ranile/gloo/blob/master/CHANGELOG.md) - [Commits](https://github.com/rustwasm/gloo/commits) --- updated-dependencies: - dependency-name: gloo dependency-version: 0.11.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* Add LFM2.5 (Liquid Foundation Model 2.5) support Add model implementation and example for LiquidAI's LFM2.5 hybrid architecture that combines attention and short convolution layers. Supports LFM2.5-1.2B and LFM2.5-1.2B-Thinking variants. * fix typo issue and crate::utils::build_causal_mask(seq_len, index_pos, device) as quantized_lfm2.rs * Apply rustfmt Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Fix clippy warnings (manual_div_ceil, large_enum_variant) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add contiguous binary add bench * Add NdIter - efficient multidim iterator * Update cpu unary and binary op traits. Add binary default vec and scalar vec impls. Remove const delegation flags * Add more optimized cpu kernels for binary add and binary scalar add * Wire up NdIter in cpu unary/binary. Also add binary scalar vec path. * Move NdIter to its own file * clippy

* Add Candle specific rayon threadpool. Use less cores by default (for now) * Use candle threadpool via device.with_context(|| {}) in quantized-qwen3 example * Have cpu flash attn use shared candle threadpool

* metal: fix CPU readback race under concurrent command submission CPU readbacks (MetalStorage::to_cpu, QMetalStorage::{data, dequantize}) encoded a blit into the shared rotating command buffer and then called wait_until_completed, which commits the current buffer and waits on the last in-flight one. With concurrent submissions from other threads, the in-flight list can be taken by another thread's flush_and_wait between the blit encode and the wait, causing the reader to return before its blit has executed and to read stale or unwritten destination memory. Fix: add Commands::flush_and_wait_current, which commits the current command buffer while holding the state lock and waits on that specific buffer. Command queues execute buffers in commit order, so completion of this buffer also covers anything committed before it by other threads. Completed buffers are drained from the in-flight list to keep it bounded under readback-heavy workloads; errored buffers are kept for reporting. The three readback sites now use it. The new concurrent readback tests fail on every thread within a few iterations against the previous behavior. * Remove mistake

…en (fixes NaN logits) (#3625) The Metal steel attention prefill kernel (`call_sdpa_full`) produces NaN logits at periodic sequence lengths when invoked with BOTH an explicit additive mask AND `do_causal = true`. An explicit causal/sliding-window mask already encodes causality. Passing `do_causal = true` additionally enables the in-kernel causal handling (the `kb_lim` block-truncation plus the causal-triangle masking loop in `scaled_dot_product_attention.metal`). The two causal mechanisms overlap and, at lengths where the partial K-tile boundary lines up unfavourably, a query row ends up with every score masked to -inf. Its softmax normalizer `sum(exp) == 0`, and the final `row_bin_op<DivOp>(sum_score)` computes `0 / 0 = NaN`, poisoning the logits. The failure is periodic in total length and quant-independent. Fix: when an explicit mask is present, skip the redundant in-kernel `do_causal` path. The mask still enforces causality, so there is no correctness change, and no measurable performance change. Verified on Apple M5 Max (Gemma-4-E4B via mistral.rs): - deterministic 38,000-char-prompt repro: FAIL -> PASS (Q4K and Q8_0 ISQ) - 48-length sweep across 5 periodic windows: 19/48 FAIL -> 0/48 (Q4K), 0/48 (Q8_0) - clean-decode throughput unchanged (~70-74 tok/s)

@ivarflakstad

* metal: route q_seq > 1 through full SDPA kernel, not vector * fmt: apply rustfmt to sdpa regression tests * metal: drop sdpa-dispatch comment and regression tests per review Per @ivarflakstad on #3479. The functional fix collapses to two predicate edits in candle-nn/src/ops.rs. --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* quantized: fix UAF in `as_t_slice` when called with `Cow::Owned` * quantized: drop explanatory comments and Miri test --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

…ble buffer builders on metal (#3542) * [Metal] Feature-gated debug groups + buffer labels (default-off) Opt-in Metal GPU-capture instrumentation behind `debug-labels` feature flag (candle-metal-kernels) and `metal-debug-labels` (candle-core, implies `metal`) * metal: trim restating rustdoc on BufferBuilder --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

Updates the requirements on [safetensors](https://github.com/huggingface/safetensors) to permit the latest version. - [Release notes](https://github.com/huggingface/safetensors/releases) - [Changelog](https://github.com/safetensors/safetensors/blob/main/RELEASE.md) - [Commits](safetensors/safetensors@v0.7.0...v0.8.0) --- updated-dependencies: - dependency-name: safetensors dependency-version: 0.8.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add neon optimized path to `BlockQ8K::from_float` * Give quantization tests a bit of wiggle-room via `compare_with_error` * Use vcvtaq_s32_f32 over vcvtnq_s32_f32 to match f32::round behaviour * Retain max-magnitude sign correctly. Restore tests to their original state

* Add persistent barrier based threadpool with per CPU cluster tree fan-out/reduce * Add GgmlType::vec_dot_2 and ::vec_dot_4 (with default impls). Add reusable scratch buffer for lhs of quantized matmul * call_lock now contains gen counter, avoiding silent dependency that root_workers[0] == 0. Add Barrier pool thread early exit path. * Add thread parking based on spin limit + per thread panic catching * Store panic payload immediately to avoid race

* add dot.rs; rm 4 dupe dot impl; migrate dot_32 call site; lean dispatch standard cpu_flash * softmax helper; impl in 13 locations; dot error compound bug resolved * drop f64 path from qwen3 dispatch * rm 2 line internal comment * fmt standard and dot * remove dead sum; 12 generic calls switched to D-type native conversion * fmt * attention score keeps full f32 range/precision * no exclusion of f64 so have bail

…around entire fwd pass instead to access the shared threadpool (#3634)

* Parameter cache for CUDA kernel launch * Enforce strict h2d/d2h invariants during CUDA graph capture * Revise guard drop * Remove clear_cuda_param_cache and d2h blocker * Add capture check for all memcpy actions * Unified guard

* Support embedding forward pass for ggml quants * Add cuda

* Fix BlockQ8K::from_float iscale/precision * Add neon dotprod optimization utils and optimized BlockQ8_0::vec_dot_4 path with neon vec_dot_4_q8_0_q8_0 * Optimized BlockQ4K::vec_dot_4 path with neon vec_dot_4_q4k_q8k * Optimized BlockQ6K::vec_dot_4 path with neon vec_dot_4_q6k_q8k * Add interleaved/repacked matmul_q4k_x8 with neon optimized path * Slightly improved neon quantize_row_q8k * Add matmul_q4k_x8 path to QTensor matmul call * No need to unsqueeze and transpose before reshape * Have candle barrier pool respond to different env var than rayon * Use candle threadpool in cpu causal FA. Other causal FA improvements * Add serial path to cpu rms norm * Add serial path to cpu rope * Update quantized test. Crossed fingers non-neon targets were affected identically * clippy * Fix pack_to_q4kx8 alignment. Return BlockQ4Kx8. Add vec_to_bytes util. * Update quantize_q8k test. * Make vec_to_bytes copy data to ensure no UB * Standard vec initialization in pack_to_q4kx8 * Use zerocopy to ensure correct alignment (already a transitive dep) * Remove unused byte/vec conversion fns

pull Bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Nov 19, 2024

EricLBuehler force-pushed the main branch from bac2055 to 96279d5 Compare January 8, 2025 17:25

lucky-bai and others added 26 commits August 18, 2025 14:08

Fix wasm build by enabling getrandom wasm_js backend (#3055)

f1286e6

pick seed <= u32::MAX when using metal (#3045)

16e1d73

Fix broken slice_scatter example in basics.rs

730fa9c

- Change tensor b from [1,2] row vector to [2,1] column vector - Fix assertion to match expected result after column replacement - Resolves shape mismatch error that prevented example from running

Run cargo fmt on basics.rs

5d6407f

Metal device.set_seed full u64 support (#3067)

98c64c0

* Add simple atomics to ulong via atomic_uintx2 struct * Remove u32::max restriction from metal device.set_seed

disable affine fp8 bench on metal as it is not supported yet (#3065)

03e9ce0

Bench using chosen device only (#3066)

02cf3eb

Fixes metal randn determinism. Ensure we use the 2 atomic_uints buffe…

fd350c4

…r correctly (#3069)

build: Make build.rs candle-kernels compatible with Nix and sandboxed…

bf82629

… builds (#3059) * Use OUT_DIR for generated PTX bindings * fix: fixed the out_dir cargo problem in examples * fix: added imports in build.rs

Fused CPU attention kernels (~4x performance increase) (#2973)

d4a9179

* Add cpu flash attention * Add test * Format * Fix docs shape

Fix typos

41b1e95

Merge pull request #3072 from szepeviktor/typos

93845ed

Fix iOS app store validation issues (#3071)

390b87a

* put ug cuda behind cuda flag * revert to ug 0.0.2 when on ios * if only use ug if target_os is not ios added to wasm check already there

Merge pull request #3038 from NoodlesOfWrath/gradstore_insert_id

402782c

clean candle-core typos.

f62e725

Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>

Ensure metal tensors are send/sync via thread isolated command buffer…

0bbf9c7

… map

Update kv_cache.rs (#3035)

3b35cfc

Merge pull request #3077 from zhanluxianshen/typo-candle-core

87fadf6

clean candle-core typos.

Fix metal exports (#3081)

0950959

Merge branch 'main' into metal-tensor-fix-send-sync

a7fbc63

Merge pull request #3079 from huggingface/metal-tensor-fix-send-sync

65055f6

[Metal] Ensure tensors are send/sync

Merge pull request #3062 from davenpi/fix/core-basics-example

b1dbce0

Fix broken slice_scatter example in basics.rs

Add CUDA 13 support (#3078)

8045af9

Update cudarc to v0.17.3 which has support for CUDA 13.

Fix indentation

97594d2

ivarflakstad and others added 30 commits May 28, 2026 22:23

Binary broadcast scalar support (#3487)

09b9145

* Add scalar support to metal binary kernels * Add Layout::is_scalar and is_scalar_like helpers. Let kernel name decide dispatch in metal binary

metal: add copy2d kernels for I16 / I32 (#3478)

5404348

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

metal: fix RMSNorm NaN on large-magnitude F32 inputs (#3477)

1a63038

[CPU] Shared threadpool + less contention (#3584)

d29cb03

* Add Candle specific rayon threadpool. Use less cores by default (for now) * Use candle threadpool via device.with_context(|| {}) in quantized-qwen3 example * Have cpu flash attn use shared candle threadpool

Verify gguf tensor size before allocating (#3585)

5c66da8

Metal SDPA: pass storage offsets in bytes, not elements (#3599)

0c58953

quantized: fix UAF in as_t_slice when called with Cow::Owned (#3493)

c1e6756

* quantized: fix UAF in `as_t_slice` when called with `Cow::Owned` * quantized: drop explanatory comments and Miri test --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

Add neon sdot behind target_feature = "dotprod" gate (#3630)

376d54c

Remove flash attn specific rayon threadpool. Use device.with_context …

29a15c2

…around entire fwd pass instead to access the shared threadpool (#3634)

Support cuda 13.3 (#3642)

a9667ca

Parameter cache for CUDA kernel launch (#3598)

f53ed3b

* Parameter cache for CUDA kernel launch * Enforce strict h2d/d2h invariants during CUDA graph capture * Revise guard drop * Remove clear_cuda_param_cache and d2h blocker * Add capture check for all memcpy actions * Unified guard

Support embedding forward pass for ggml quants (#3644)

5152ef6

* Support embedding forward pass for ggml quants * Add cuda

Add paged flash-attn kernels (#3655)

9bcfd98

Bump candle version to 0.11.0 (#3658)

31f35b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from huggingface:main#41

[pull] main from huggingface:main#41
pull[bot] wants to merge 437 commits into
EricLBuehler:mainfrom
huggingface:main

pull Bot commented Nov 19, 2024 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 8, 2025 •

edited

Loading

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pull Bot commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pull Bot commented Nov 19, 2024 •

edited

Loading

coderabbitai Bot commented May 8, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)