Skip to content

fix: clear prompt for recurrent / hybrid models when only a partial prefix matches#2108

Merged
abetlen merged 1 commit into
abetlen:mainfrom
avion23:update-llama-cpp-2026-01
Jun 1, 2026
Merged

fix: clear prompt for recurrent / hybrid models when only a partial prefix matches#2108
abetlen merged 1 commit into
abetlen:mainfrom
avion23:update-llama-cpp-2026-01

Conversation

@avion23
Copy link
Copy Markdown
Contributor

@avion23 avion23 commented Jan 1, 2026

Summary

  • Cache recurrent and hybrid model detection separately in Llama.
  • Clear llama.cpp memory when resetting recurrent or hybrid models.
  • Force full prompt re-processing for recurrent or hybrid models when prompt history is edited instead of attempting partial prefix reuse.
  • Add repeated-prompt, recurrent, and hybrid regression coverage for prompt-cache behavior.

Tests

  • python -m compileall llama_cpp/llama.py tests/test_llama.py
  • python -m pytest tests/test_llama.py::test_real_llama_repeated_prompt_cache tests/test_llama.py::test_recurrent_model_prompt_cache_reset tests/test_llama.py::test_hybrid_model_prompt_cache_reset -q

@avion23 avion23 marked this pull request as draft January 1, 2026 19:40
@avion23 avion23 force-pushed the update-llama-cpp-2026-01 branch from 502532a to 23c10e8 Compare January 1, 2026 19:50
@avion23 avion23 marked this pull request as ready for review January 1, 2026 19:52
@avion23
Copy link
Copy Markdown
Contributor Author

avion23 commented Jan 1, 2026

Tested on macos using CMAKE_ARGS="-DGGML_METAL=on" pip3.14 install --force-reinstall --no-cache-dir "llama-cpp-python @ git+https://github.com/avion23/llama-cpp-python.git@update-llama-cpp-2026-01" --break-system-packages

@dhdaines
Copy link
Copy Markdown

dhdaines commented Jan 4, 2026

This will need at least one more (very important) change as the layout of mtmd_context_params has changed. It should be updated in mtmd_cpp.py to this:

class mtmd_context_params(Structure):
    _fields_ = [
        ("use_gpu", c_bool),
        ("print_timings", c_bool),
        ("n_threads", c_int),
        ("image_marker", c_char_p),
        ("media_marker", c_char_p),
        ("flash_attn_type", c_int),
        ("warmup", c_bool),
        ("image_min_tokens", c_int),
        ("image_max_tokens", c_int),
    ]

@dhdaines
Copy link
Copy Markdown

dhdaines commented Jan 4, 2026

More changes needed as the layout of llama_context_params has also changed... a new field flash_attn_type has been added after attention_type.

@dhdaines
Copy link
Copy Markdown

dhdaines commented Jan 4, 2026

Also the flash_attn parameter no longer exists, has been replaced by flash_attn_type... the default is now to determine automatically whether to use it (as some models require it). This is unfortunately a breaking change, not sure if you want to preserve the flash_attn parameter in the higher-level Python API.

Comment thread llama_cpp/llama_cpp.py Outdated
@avion23 avion23 marked this pull request as draft January 4, 2026 13:41
@avion23 avion23 force-pushed the update-llama-cpp-2026-01 branch from 5042296 to d14a24f Compare January 4, 2026 13:42
@avion23
Copy link
Copy Markdown
Contributor Author

avion23 commented Jan 4, 2026

@dhdaines thanks for the review, I need some time to incorporate your comments, setting to draft in the meantime

@avion23 avion23 force-pushed the update-llama-cpp-2026-01 branch 3 times, most recently from 64b087c to 3ffec02 Compare January 5, 2026 10:18
@avion23
Copy link
Copy Markdown
Contributor Author

avion23 commented Jan 5, 2026

Also the flash_attn parameter no longer exists, has been replaced by flash_attn_type... the default is now to determine automatically whether to use it (as some models require it). This is unfortunately a breaking change, not sure if you want to preserve the flash_attn parameter in the higher-level Python API.

I think I have fixed this, could you check?

@avion23 avion23 force-pushed the update-llama-cpp-2026-01 branch 2 times, most recently from 6dbddac to 39a2ee8 Compare January 5, 2026 14:35
@dhdaines
Copy link
Copy Markdown

dhdaines commented Jan 5, 2026

Also the flash_attn parameter no longer exists, has been replaced by flash_attn_type... the default is now to determine automatically whether to use it (as some models require it). This is unfortunately a breaking change, not sure if you want to preserve the flash_attn parameter in the higher-level Python API.

I think I have fixed this, could you check?

Yes, this looks to me like a good way to handle it! We can see what the maintainer @abetlen thinks though...

@avion23 avion23 force-pushed the update-llama-cpp-2026-01 branch from 39a2ee8 to 103f671 Compare January 6, 2026 19:17
@avion23 avion23 marked this pull request as ready for review January 6, 2026 19:22
@avion23
Copy link
Copy Markdown
Contributor Author

avion23 commented Jan 6, 2026

My intention was to sweep in like a hero and save the day. Didn't work as planned :/

I've rewritten the PR, much less whitespace noise, and cleaner. All review comments are incorporated.

@oss-roettger
Copy link
Copy Markdown

oss-roettger commented Jan 8, 2026

Thank you so much avion23 for your efforts to update the python bindings to a recent llama-cpp version!

I'm trying to use them in a Jupyter notebook (in Docker) on a Nvidia 5090 GPU. Although the latest locally build llama-cli version is running in that same environment (see attached llama-cli.txt) and the above considered problems are gone, the freshly build bindings produce a kernel crash when loading models to GPU (after loading weights to GPU, maybe a context issue, see attached build.txt).

I'm pretty sure, it can be my mistake when installing your branch for GPU support:
!CMAKE_ARGS="-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86";pip install --force-reinstall --upgrade git+https://github.com/avion23/llama-cpp-python@update-llama-cpp-2026-01

Any ideas what I did wrong?!

Edit (New Findings): The above GPU build works with n_gpu_layers=0 (CPU only). This narrows down the problem to context handling in the GPU context code path.
Edit2: Very,very strange: after switching back to n_gpu_layers=100 (from n_gpu_layers=0) I was able to load and successfully run the new Nemotron-Nano-3-30B-A3B-Q4_K_M.gguf and Ling-mini-2.0.Q4_K_M.gguf models on GPU (on the same build that crashed the kernel always(!) while loading the models before. Could it be, that there is any context initialization code which is run in CPU mode only but also important for GPU mode?!

@oss-roettger
Copy link
Copy Markdown

@abetlen Thank you for your work! Please keep this repo alive and merge avion23's updates into the main branch.
These are working now on my RTX 5090 Cuda environment. It's not only that recent llama.cpp supports additional models, but it also offers significant performance gains (e.g. GPT OSS 20 NVFP4 +50% more tokens/s compared to the Oktober versions in your repo wheels).

@avion23 avion23 force-pushed the update-llama-cpp-2026-01 branch from e351642 to 235a3d4 Compare January 12, 2026 06:03
@avion23
Copy link
Copy Markdown
Contributor Author

avion23 commented Jan 12, 2026

@oss-roettger thank you for all the testing. I found a bug with flash_attn which caused your error. Could you retest?

@avion23 avion23 force-pushed the update-llama-cpp-2026-01 branch from 235a3d4 to 17aae47 Compare January 12, 2026 06:35
@oss-roettger
Copy link
Copy Markdown

oss-roettger commented Jan 12, 2026

@avion23 once again respect for your dedication. I have tested the new version after building it with

!CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=86";pip install --force-reinstall --upgrade git+https://github.com/avion23/llama-cpp-python@update-llama-cpp-2026-01

Good news first:
The update runs out of the box (with all values for flash_attn in the constructor flash_attn=None/True/False/not present)

But: I think I have discovered an additional issue (with & without your latest update; on GPU & CPU):
https://huggingface.co/ggml-org/Nemotron-Nano-3-30B-A3B-GGUF produces a KV cache issue on the second dialog turn. I guess there is a cache initialization parameter missing in the python bindings, since the llama-cli command of the same llama build (=same libllama.so) works on multi-turn dialogs with the same Nemotron model.
(see Llama_test.txt for minimum code to reproduce the error)

Edit: Log added:
log.txt

@avion23
Copy link
Copy Markdown
Contributor Author

avion23 commented Jan 12, 2026 via email

avion23 pushed a commit to avion23/llama-cpp-python that referenced this pull request Jan 14, 2026
After external code review (GPT-5.2), fixed 4 critical issues:

1. CRITICAL: Fixed tokens[:-1] bug in prefix matching
   - Was silently breaking prefix matching for ALL models
   - Caused false rewind detection and cache inefficiency
   - Impact: Transformers AND recurrent models

2. CRITICAL: Implement proper reset() for recurrent models
   - Now actually clears llama_memory backend state
   - Root cause fix for 'sequence positions not consecutive' crash
   - Without this, reset was a no-op for recurrent models

3. CRITICAL: Enforce strict append policy for recurrent models
   - Prevents KV cache rewinding that's impossible without state snapshots
   - Forces full reset on history edits instead of crashing

4. Performance: Cache _is_recurrent to avoid repeated FFI calls

5. Documentation: Simplified comments and updated docstring

6. Testing: All existing tests pass + Mistral-Small-3.2-24B validated

Resolves multi-turn crashes for Nemotron-A3B, Mamba, RWKV, Jamba models.

Reviewed-by: GPT-5.2 (OpenAI)
Tested-by: pytest + Mistral-Small-3.2-24B
Fixes: abetlen#2108 (recurrent model crashes)
Compatible-with: abetlen#2109 (Granite-Docling/SmolVLM special tokens)
@avion23
Copy link
Copy Markdown
Contributor Author

avion23 commented Jan 14, 2026

I've implemented a fix for recurrent/hybrid models (Nemotron-A3B, Mamba, RWKV, Jamba)
that prevents "sequence positions not consecutive" crashes during multi-turn conversations.
The fix preserves full speed for normal append-only chat and only triggers reset when
history is edited. Verified compatible with #2109.

It's a bit of scope creep though. Diff is becoming huge even though I only adapted some c bindings.

@avion23 avion23 marked this pull request as ready for review January 14, 2026 02:21
@avion23
Copy link
Copy Markdown
Contributor Author

avion23 commented Jan 19, 2026

Thank you for retesting this. I am using it daily on Apple M4 Max and it's working good enough.

@abetlen Is there something I can improve so you can merge this with a good conscience?

@antheas
Copy link
Copy Markdown

antheas commented Jan 19, 2026

I vouch for this, thanks @avion23. Saved my bacon on dottxt-ai/outlines#1812. I will test a few more models and if anything pops up will report back.

TheBigEye added a commit to TheBigEye/guanaco-py that referenced this pull request Jan 24, 2026
@bartwesthoff-fyrm
Copy link
Copy Markdown

@avion23 Do you have a notebook for testing? Can't seem to run nemotron yet but it most probably is a mistake on my end.

@avion23
Copy link
Copy Markdown
Contributor Author

avion23 commented Feb 4, 2026

@bartwesthoff-fyrm The PR is stable, but Nemotron is a tricky model (Hybrid architecture) that requires specific initialization parameters to run correctly. n_batch=512, n_ubatch=512, flash_attn=True. Vibe coded snippet attached, test_pr_2108.py

This pr might never be merged, project seems abandonware. Have a look at https://github.com/TheBigEye/guanaco-py

@bartwesthoff-fyrm
Copy link
Copy Markdown

@avion23 worked perfectly with the new repository. Thank you for your helpfull responses in this PR.

@mxbi
Copy link
Copy Markdown

mxbi commented Feb 10, 2026

For anyone else in this boat, I can confirm https://github.com/TheBigEye/guanaco-py v0.5.0 works great. Thanks @TheBigEye!

@avion23 avion23 force-pushed the update-llama-cpp-2026-01 branch from f427399 to e9a538f Compare April 4, 2026 06:21
avion23 pushed a commit to avion23/llama-cpp-python that referenced this pull request Apr 4, 2026
After external code review (GPT-5.2), fixed 4 critical issues:

1. CRITICAL: Fixed tokens[:-1] bug in prefix matching
   - Was silently breaking prefix matching for ALL models
   - Caused false rewind detection and cache inefficiency
   - Impact: Transformers AND recurrent models

2. CRITICAL: Implement proper reset() for recurrent models
   - Now actually clears llama_memory backend state
   - Root cause fix for 'sequence positions not consecutive' crash
   - Without this, reset was a no-op for recurrent models

3. CRITICAL: Enforce strict append policy for recurrent models
   - Prevents KV cache rewinding that's impossible without state snapshots
   - Forces full reset on history edits instead of crashing

4. Performance: Cache _is_recurrent to avoid repeated FFI calls

5. Documentation: Simplified comments and updated docstring

6. Testing: All existing tests pass + Mistral-Small-3.2-24B validated

Resolves multi-turn crashes for Nemotron-A3B, Mamba, RWKV, Jamba models.

Reviewed-by: GPT-5.2 (OpenAI)
Tested-by: pytest + Mistral-Small-3.2-24B
Fixes: abetlen#2108 (recurrent model crashes)
Compatible-with: abetlen#2109 (Granite-Docling/SmolVLM special tokens)
avion23 pushed a commit to avion23/llama-cpp-python that referenced this pull request Apr 4, 2026
After external code review (GPT-5.2), fixed 4 critical issues:

1. CRITICAL: Fixed tokens[:-1] bug in prefix matching
   - Was silently breaking prefix matching for ALL models
   - Caused false rewind detection and cache inefficiency
   - Impact: Transformers AND recurrent models

2. CRITICAL: Implement proper reset() for recurrent models
   - Now actually clears llama_memory backend state
   - Root cause fix for 'sequence positions not consecutive' crash
   - Without this, reset was a no-op for recurrent models

3. CRITICAL: Enforce strict append policy for recurrent models
   - Prevents KV cache rewinding that's impossible without state snapshots
   - Forces full reset on history edits instead of crashing

4. Performance: Cache _is_recurrent to avoid repeated FFI calls

5. Documentation: Simplified comments and updated docstring

6. Testing: All existing tests pass + Mistral-Small-3.2-24B validated

Resolves multi-turn crashes for Nemotron-A3B, Mamba, RWKV, Jamba models.

Reviewed-by: GPT-5.2 (OpenAI)
Tested-by: pytest + Mistral-Small-3.2-24B
Fixes: abetlen#2108 (recurrent model crashes)
Compatible-with: abetlen#2109 (Granite-Docling/SmolVLM special tokens)
@avion23 avion23 force-pushed the update-llama-cpp-2026-01 branch 2 times, most recently from 4d4c571 to f9dd86c Compare April 4, 2026 07:41
@avion23
Copy link
Copy Markdown
Contributor Author

avion23 commented Apr 5, 2026

Upstream was updated and the submodule is updated. Most changes are upstream. This MR is not strictly necessary.

I rebased and only kept the functionality which is not in upstream yet.

@abetlen abetlen changed the title Update to llama.cpp 2026-01-01 fix: clear prompt for recurrent / hybrid models Jun 1, 2026
@abetlen abetlen force-pushed the update-llama-cpp-2026-01 branch 5 times, most recently from 98f8bfa to 1a4678f Compare June 1, 2026 02:58
@abetlen
Copy link
Copy Markdown
Owner

abetlen commented Jun 1, 2026

@avion23 thank you for the contribution, i extracted just the hybrid / recurrent prompt re-use fix and added some tests for those types of models.

@abetlen abetlen changed the title fix: clear prompt for recurrent / hybrid models fix: clear prompt for recurrent / hybrid models when only a partial prefix matches Jun 1, 2026
@abetlen abetlen force-pushed the update-llama-cpp-2026-01 branch from 1a4678f to 08fb954 Compare June 1, 2026 03:05
@abetlen abetlen merged commit cdb7a75 into abetlen:main Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants