Skip to content

Update tokenizers requirement from 0.22 to 0.23#110

Merged
gorzell merged 1 commit into
mainfrom
dependabot/cargo/tokenizers-0.23
Jun 1, 2026
Merged

Update tokenizers requirement from 0.22 to 0.23#110
gorzell merged 1 commit into
mainfrom
dependabot/cargo/tokenizers-0.23

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot Bot commented on behalf of github May 2, 2026

Updates the requirements on tokenizers to permit the latest version.

Release notes

Sourced from tokenizers's releases.

Release v0.23.1

TL;DR

tokenizers 0.23.1 is the first proper stable release in the 0.23 line — 0.23.0 only ever shipped as rc0 because the release pipeline itself was broken (Node side hadn't shipped multi-platform binaries since 2023, Python side was on pyo3 0.27 without free-threaded support). 0.23.1 is the version where everything actually goes out the door together: full Node multi-platform wheels for the first time in years, Python 3.14 (regular and free-threaded 3.14t), full type hints for every Python class, and a stack of measurable perf wins on the BPE / added-vocab hot paths.

There is no functional 0.23.0 published — we tag 0.23.1 directly so users don't accidentally pull a never-shipped version.


🚨 Breaking changes

  • Drop Python 3.9 (#1952) — requires-python = ">=3.10"; 3.9 users stay on 0.22.x.
  • add_tokens normalizes content at insertion (#1995) — re-saved tokenizer.json may differ in the added_tokens block. Existing files load unchanged.
  • Type stubs are precise (#1928, #1997) — methods that returned Any now return real types; mypy --strict may surface previously-hidden errors. Stub layout also moved from tokenizers/<sub>/__init__.pyi to tokenizers/<sub>.pyi. This breaks the surface of some of the processors like RobertaProcessign's __init__ .
  • 3.14t-only: setters/getters return PyResult<T> because of Arc<RwLock<Tokenizer>>; a poisoned lock surfaces as PyException instead of a panic.

⚡ Performance — measured locally on this Mac, not lifted from PRs

Run with cargo bench --bench <name> -- --save-baseline v0_22_2 on v0.22.2, then --baseline v0_22_2 on v0.23.1. Numbers are point-in-time wall clock on a single laptop; relative deltas are what matters, absolute numbers will differ on CI hardware.

Added-vocabulary deserialize — the headline win (#1995, #1999)

bench: improve added_vocab_deserialize to reflect real-world workloads (#2000) is now representative of how transformers actually loads tokenizer.json files. The combined effect of daachorse for the matching automaton plus the normalize-on-insert refactor is enormous on this workload:

benchmark v0.22.2 v0.23.1 change
100k tokens, special, no norm ~410 ms 248 ms −40%
100k tokens, non-special, no norm ~7.1 s 273 ms −96%
100k tokens, special, NFKC ~395 ms 235 ms −40%
100k tokens, non-special, NFKC ~7.4 s 290 ms −96%
400k tokens, special, no norm ~15 s 980 ms −94%

Real-world impact: loading a Llama-3-style tokenizer with a large set of added tokens dropped from "noticeable pause" to "instant".

BPE encode

benchmark v0.22.2 v0.23.1 change
BPE GPT2 encode batch, no cache 530 ms 446 ms −16%
BPE GPT2 encode batch (cached) 690 ms 685 ms noise
BPE GPT2 encode (single) 1.95 s 1.94 s noise
BPE Train (small) 32.6 ms 31.5 ms −3%
BPE Train (big) 1.01 s 988 ms −2%

The BPE per-thread cache PR (#2028) shows much larger wins on highly-parallel workloads (+47–62% at 88+ threads on a server box, per the PR's own measurements on Vera). Single-thread batch numbers above are flat or slightly improved because cache-hit overhead was already low without contention.

Llama-3 encode

... (truncated)

Commits

Note
Automatic rebases have been disabled on this pull request as it has been open for over 30 days.

@dependabot dependabot Bot added dependencies Pull requests that update a dependency file rust Pull requests that update Rust code labels May 2, 2026
@dependabot dependabot Bot requested a review from a team as a code owner May 2, 2026 05:03
@dependabot dependabot Bot added the rust Pull requests that update Rust code label May 2, 2026
@tclem
Copy link
Copy Markdown
Member

tclem commented May 28, 2026

@dependabot rebase

@dependabot dependabot Bot force-pushed the dependabot/cargo/tokenizers-0.23 branch from 69e91b1 to 2dad252 Compare May 28, 2026 18:46
@tclem
Copy link
Copy Markdown
Member

tclem commented May 29, 2026

@dependabot rebase

Updates the requirements on [tokenizers](https://github.com/huggingface/tokenizers) to permit the latest version.
- [Release notes](https://github.com/huggingface/tokenizers/releases)
- [Changelog](https://github.com/huggingface/tokenizers/blob/main/RELEASE.md)
- [Commits](huggingface/tokenizers@v0.22.0...v0.23.1)

---
updated-dependencies:
- dependency-name: tokenizers
  dependency-version: 0.23.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot force-pushed the dependabot/cargo/tokenizers-0.23 branch from 2dad252 to 6652a25 Compare May 29, 2026 03:59
@gorzell gorzell merged commit a4e53aa into main Jun 1, 2026
4 checks passed
@gorzell gorzell deleted the dependabot/cargo/tokenizers-0.23 branch June 1, 2026 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants