Load added tokens from GGUF metadata (fixes missing <think>/</think> on Qwen) by jaweed3 · Pull Request #3641 · huggingface/candle

jaweed3 · 2026-06-23T04:03:51Z

The tokenizer.ggml.added_tokens field in GGUF metadata was being ignored. Tokens like <think> and </think> used by Qwen models were not registered with the tokenizer, so they could not be encoded as single tokens.

This reads the field (when present) and registers each token as a special token via Tokenizer::add_special_tokens, matching the behavior of the Hugging Face tokenizers that generated the GGUF file.

Fixes #3473

Tests

added_tokens_are_loaded_as_special — verifies added tokens get vocab entries
added_tokens_encode_as_single_token — verifies "<think>" encodes as 1 token
no_added_tokens_does_not_fail — verifies absence of the field is harmless

GGUF files may include a tokenizer.ggml.added_tokens field containing tokens that were added after training (e.g. <think> and </think> for Qwen models). The from_gguf implementation was ignoring this field, so these tokens were invisible to the tokenizer. Read tokenizer.ggml.added_tokens from the GGUF metadata and register them as special tokens on the tokenizer. This ensures they are treated as single tokens during encoding. Fixes huggingface#3473

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Load added tokens from GGUF metadata (fixes missing <think>/</think> on Qwen)#3641

Load added tokens from GGUF metadata (fixes missing <think>/</think> on Qwen)#3641
jaweed3 wants to merge 1 commit into
huggingface:mainfrom
jaweed3:fix/added-tokens-gguf

jaweed3 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jaweed3 commented Jun 23, 2026

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant