fix: clear prompt for recurrent / hybrid models when only a partial prefix matches#2108
Conversation
502532a to
23c10e8
Compare
|
Tested on macos using CMAKE_ARGS="-DGGML_METAL=on" pip3.14 install --force-reinstall --no-cache-dir "llama-cpp-python @ git+https://github.com/avion23/llama-cpp-python.git@update-llama-cpp-2026-01" --break-system-packages |
|
This will need at least one more (very important) change as the layout of class mtmd_context_params(Structure):
_fields_ = [
("use_gpu", c_bool),
("print_timings", c_bool),
("n_threads", c_int),
("image_marker", c_char_p),
("media_marker", c_char_p),
("flash_attn_type", c_int),
("warmup", c_bool),
("image_min_tokens", c_int),
("image_max_tokens", c_int),
] |
|
More changes needed as the layout of |
|
Also the |
23c10e8 to
a070f61
Compare
5042296 to
d14a24f
Compare
|
@dhdaines thanks for the review, I need some time to incorporate your comments, setting to draft in the meantime |
64b087c to
3ffec02
Compare
I think I have fixed this, could you check? |
6dbddac to
39a2ee8
Compare
Yes, this looks to me like a good way to handle it! We can see what the maintainer @abetlen thinks though... |
39a2ee8 to
103f671
Compare
|
My intention was to sweep in like a hero and save the day. Didn't work as planned :/ I've rewritten the PR, much less whitespace noise, and cleaner. All review comments are incorporated. |
|
Thank you so much avion23 for your efforts to update the python bindings to a recent llama-cpp version! I'm trying to use them in a Jupyter notebook (in Docker) on a Nvidia 5090 GPU. Although the latest locally build llama-cli version is running in that same environment (see attached llama-cli.txt) and the above considered problems are gone, the freshly build bindings produce a kernel crash when loading models to GPU (after loading weights to GPU, maybe a context issue, see attached build.txt). I'm pretty sure, it can be my mistake when installing your branch for GPU support: Any ideas what I did wrong?! Edit (New Findings): The above GPU build works with n_gpu_layers=0 (CPU only). This narrows down the problem to context handling in the GPU context code path. |
|
@abetlen Thank you for your work! Please keep this repo alive and merge avion23's updates into the main branch. |
e351642 to
235a3d4
Compare
|
@oss-roettger thank you for all the testing. I found a bug with |
235a3d4 to
17aae47
Compare
|
@avion23 once again respect for your dedication. I have tested the new version after building it with !CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=86";pip install --force-reinstall --upgrade git+https://github.com/avion23/llama-cpp-python@update-llama-cpp-2026-01 Good news first: But: I think I have discovered an additional issue (with & without your latest update; on GPU & CPU): Edit: Log added: |
|
Thank you,
Could You attach the log as before? Then it'll be quicker for me and you
get a nice solution.
Let's just pray this gets merged, i don't want to maintain my fork.
…On Mon 12. Jan 2026 at 18:29, oss-roettger ***@***.***> wrote:
*oss-roettger* left a comment (abetlen/llama-cpp-python#2108)
<#2108 (comment)>
@avion23 <https://github.com/avion23> once again respect for your
dedication. I have tested the new version after building it with
!CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=86";pip install
--force-reinstall --upgrade git+
***@***.***
Good news first:
The update runs out of the box (with all values for *flash_attn* in the
constructor flash_attn=None/True/False/not present)
But: I think I have discovered an *additional issue* (with & without your
latest update; on GPU & CPU):
https://huggingface.co/ggml-org/Nemotron-Nano-3-30B-A3B-GGUF produces a
KV cache issue on the second dialog turn. I guess there is a cache
initialization parameter missing in the python bindings, since the
llama-cli command of the same llama build (=same libllama.so) works on
multi-turn dialogs with the same Nemotron model.
(see Llama_test.txt
<https://github.com/user-attachments/files/24562603/Llama_test.txt> for
minimum code to reproduce the error)
—
Reply to this email directly, view it on GitHub
<#2108 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTGWLHPMSGS7UVCWIRBTY34GOAP5AVCNFSM6AAAAACQO5EZGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMZYGEYDCNRRGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
After external code review (GPT-5.2), fixed 4 critical issues: 1. CRITICAL: Fixed tokens[:-1] bug in prefix matching - Was silently breaking prefix matching for ALL models - Caused false rewind detection and cache inefficiency - Impact: Transformers AND recurrent models 2. CRITICAL: Implement proper reset() for recurrent models - Now actually clears llama_memory backend state - Root cause fix for 'sequence positions not consecutive' crash - Without this, reset was a no-op for recurrent models 3. CRITICAL: Enforce strict append policy for recurrent models - Prevents KV cache rewinding that's impossible without state snapshots - Forces full reset on history edits instead of crashing 4. Performance: Cache _is_recurrent to avoid repeated FFI calls 5. Documentation: Simplified comments and updated docstring 6. Testing: All existing tests pass + Mistral-Small-3.2-24B validated Resolves multi-turn crashes for Nemotron-A3B, Mamba, RWKV, Jamba models. Reviewed-by: GPT-5.2 (OpenAI) Tested-by: pytest + Mistral-Small-3.2-24B Fixes: abetlen#2108 (recurrent model crashes) Compatible-with: abetlen#2109 (Granite-Docling/SmolVLM special tokens)
|
I've implemented a fix for recurrent/hybrid models (Nemotron-A3B, Mamba, RWKV, Jamba) It's a bit of scope creep though. Diff is becoming huge even though I only adapted some c bindings. |
|
👍 @avion23: Thank you very much on behalf of the community! Tests successfully passed with Nemotron-3-Nano-30B-A3B on RTX5090 (CUDA). ✅ https://huggingface.co/unsloth/Nemotron-3-Nano-30B-A3B-GGUF/blob/main/Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf Also successfully retested on RTX5090 (CUDA): ✅ https://huggingface.co/bartowski/openai_gpt-oss-20b-GGUF?show_file_info=openai_gpt-oss-20b-Q4_K_M.gguf 🙋 @abetlen: Please appreciate the extensive work of @avion23 and merge it into the main branch. Thank you! |
|
Thank you for retesting this. I am using it daily on Apple M4 Max and it's working good enough. @abetlen Is there something I can improve so you can merge this with a good conscience? |
|
I vouch for this, thanks @avion23. Saved my bacon on dottxt-ai/outlines#1812. I will test a few more models and if anything pops up will report back. |
|
@avion23 Do you have a notebook for testing? Can't seem to run nemotron yet but it most probably is a mistake on my end. |
|
@bartwesthoff-fyrm The PR is stable, but Nemotron is a tricky model (Hybrid architecture) that requires specific initialization parameters to run correctly. n_batch=512, n_ubatch=512, flash_attn=True. Vibe coded snippet attached, test_pr_2108.py This pr might never be merged, project seems abandonware. Have a look at https://github.com/TheBigEye/guanaco-py |
|
@avion23 worked perfectly with the new repository. Thank you for your helpfull responses in this PR. |
|
For anyone else in this boat, I can confirm https://github.com/TheBigEye/guanaco-py v0.5.0 works great. Thanks @TheBigEye! |
f427399 to
e9a538f
Compare
After external code review (GPT-5.2), fixed 4 critical issues: 1. CRITICAL: Fixed tokens[:-1] bug in prefix matching - Was silently breaking prefix matching for ALL models - Caused false rewind detection and cache inefficiency - Impact: Transformers AND recurrent models 2. CRITICAL: Implement proper reset() for recurrent models - Now actually clears llama_memory backend state - Root cause fix for 'sequence positions not consecutive' crash - Without this, reset was a no-op for recurrent models 3. CRITICAL: Enforce strict append policy for recurrent models - Prevents KV cache rewinding that's impossible without state snapshots - Forces full reset on history edits instead of crashing 4. Performance: Cache _is_recurrent to avoid repeated FFI calls 5. Documentation: Simplified comments and updated docstring 6. Testing: All existing tests pass + Mistral-Small-3.2-24B validated Resolves multi-turn crashes for Nemotron-A3B, Mamba, RWKV, Jamba models. Reviewed-by: GPT-5.2 (OpenAI) Tested-by: pytest + Mistral-Small-3.2-24B Fixes: abetlen#2108 (recurrent model crashes) Compatible-with: abetlen#2109 (Granite-Docling/SmolVLM special tokens)
After external code review (GPT-5.2), fixed 4 critical issues: 1. CRITICAL: Fixed tokens[:-1] bug in prefix matching - Was silently breaking prefix matching for ALL models - Caused false rewind detection and cache inefficiency - Impact: Transformers AND recurrent models 2. CRITICAL: Implement proper reset() for recurrent models - Now actually clears llama_memory backend state - Root cause fix for 'sequence positions not consecutive' crash - Without this, reset was a no-op for recurrent models 3. CRITICAL: Enforce strict append policy for recurrent models - Prevents KV cache rewinding that's impossible without state snapshots - Forces full reset on history edits instead of crashing 4. Performance: Cache _is_recurrent to avoid repeated FFI calls 5. Documentation: Simplified comments and updated docstring 6. Testing: All existing tests pass + Mistral-Small-3.2-24B validated Resolves multi-turn crashes for Nemotron-A3B, Mamba, RWKV, Jamba models. Reviewed-by: GPT-5.2 (OpenAI) Tested-by: pytest + Mistral-Small-3.2-24B Fixes: abetlen#2108 (recurrent model crashes) Compatible-with: abetlen#2109 (Granite-Docling/SmolVLM special tokens)
4d4c571 to
f9dd86c
Compare
|
Upstream was updated and the submodule is updated. Most changes are upstream. This MR is not strictly necessary. I rebased and only kept the functionality which is not in upstream yet. |
98f8bfa to
1a4678f
Compare
|
@avion23 thank you for the contribution, i extracted just the hybrid / recurrent prompt re-use fix and added some tests for those types of models. |
1a4678f to
08fb954
Compare
Summary
Llama.Tests
python -m compileall llama_cpp/llama.py tests/test_llama.pypython -m pytest tests/test_llama.py::test_real_llama_repeated_prompt_cache tests/test_llama.py::test_recurrent_model_prompt_cache_reset tests/test_llama.py::test_hybrid_model_prompt_cache_reset -q