Skip to content

Bump llama.cpp to 2d97363 (b9870), release v0.8.32#64

Merged
nyo16 merged 1 commit into
masterfrom
bump-llama-cpp-b9870
Jul 4, 2026
Merged

Bump llama.cpp to 2d97363 (b9870), release v0.8.32#64
nyo16 merged 1 commit into
masterfrom
bump-llama-cpp-b9870

Conversation

@nyo16

@nyo16 nyo16 commented Jul 4, 2026

Copy link
Copy Markdown
Owner

Summary

Updates the vendor/llama.cpp submodule from f708a5b2c (b9846) to 2d973636e (b9870) — 24 upstream commits — and releases v0.8.32.

API compatibility

No NIF changes were required. ggml/include/ggml.h, ggml/include/ggml-backend.h, common/common.h, common/chat.h,
common/json-schema-to-grammar.h, common/sampling.h, and common/speculative.h are all unchanged. The only touched header the binding compiles
against is include/llama.h, and its diff is purely additive: a new llama_ftype_name() helper and a new llama_model_ftype() getter (#25134) —
the binding calls neither.

Notable upstream changes

  • llama API: add llama_model_ftype() / llama_ftype_name() for reading a model's quantization type (#25134)
  • model: register t_layer_inp for qwen3next (#25141)
  • chat: trim messages sent to the StepFun parser, fixing long reasoning loops (#25238)
  • spec/dflash: support spec-draft-p-min in DFlash (#25246)
  • CUDA: topk-moe fusion for 288 experts (#25267); remove redundant copies after gated_delta_net (#23940); __restrict__ + PDL for
    FlashAttention (#25185); fix KQ mask stride truncation/overflow in flash_attn_mask_to_KV_max (#24945); fix get_rows_back for >65535 rows
    (#25103); fix Gemma E4B MTP FlashAttention (#25148)
  • ggml-cpu: AVX2 optimization for the nvfp4 dot product using a UE4M3 LUT (#23961)
  • opencl: precompiled binary kernel loading (#23042); initial q1_0 support (#25160)
  • hexagon: flash-attention rework (#25085)
  • common/server: HF primary split as model path (#25194); bracketed IPv6 literals in URL authorities (#25140); SSE keepalive pings (#25241);
    cpp-httplib 0.49.0 (#25218)

Verification

  • ✅ Clean rebuild from source (mix clean && mix compile)
  • ✅ Full test suite: 216 passed, 1 skipped (generation: Qwen3.5-0.8B, embeddings: Qwen3-Embedding-0.6B)
  • ✅ All 7 end-to-end smoke tests pass (generation, streaming, chat templates, JSON-schema grammar, raw GBNF, embeddings)
  • mix format --check-formatted clean
  • ✅ Dialyzer: 0 errors

@nyo16 nyo16 merged commit 4746e98 into master Jul 4, 2026
4 checks passed
@nyo16 nyo16 deleted the bump-llama-cpp-b9870 branch July 4, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant