פָּת֤וֹת אֹתָהּ֙ פִּתִּ֔ים וְיָצַקְתָּ֥ עָלֶ֖יהָ שָׁ֑מֶן מִנְחָ֖ה הִֽוא
Patot is a Python toolkit for Hebrew/English-aware semantic chunking. It chunks texts into semantically coherent, model-ready units for downstream AI workflows such as embedding, retrieval, and question answering.
Sefaria is an open-source library and database of Jewish texts. To download Sefaria data for use with Patot, visit developers.sefaria.org.
Working with long-form Jewish texts in AI systems requires balancing three goals:
- Preserve meaning — chunks should follow shifts in topic, not arbitrary character counts.
- Preserve structure — Sefaria's segment boundaries are meaningful and should be respected.
- Respect model limits — every chunk must fit the embedding model's token constraints.
Patot implements a practical multi-pass chunking pipeline to satisfy all three.
Patot processes one Sefaria section at a time.
- Input: ordered Sefaria segments for a single section.
- Chunks may combine adjacent segments within that section.
- Chunks never cross section boundaries.
Patot uses Aurelio Labs' semantic-chunkers library (specifically StatisticalChunker) over the ordered segment list.
- Each Sefaria segment is treated as an atomic unit.
- Segment embeddings are compared against local semantic context.
- Split points are chosen where semantic continuity drops.
- Semantic continuities/drops are based on comparing Gemini embeddings of the adjacent windows
- The chunker auto-selects thresholds to keep median chunk size near a configured token target.
Result: coherent multi-segment chunks where appropriate, while preserving original segment boundaries.
Only singleton segments left untouched by Pass 1 are eligible for further splitting.
- The segment is split into sentence/clause units.
- The same statistical chunking method is applied inside that segment.
Result: every output chunk is either a combination of whole Sefaria segments, or a subdivision of one single segment. Patot never mixes partial text from multiple segments in one chunk.
Semantic chunking optimizes coherence but does not strictly guarantee maximum token size compliance. Patot performs a final validation against max_split_tokens.
- Oversized multi-segment chunks are split on segment boundaries.
- Oversized single-segment chunks are split into fixed token windows as a final fallback.
Result: all chunks are semantically informed and model-safe.
- No cross-section chunking
- No splitting inside grouped multi-segment chunks after Pass 1
- No chunk containing partial text from multiple segments
- All output chunks bounded by
max_split_tokens
Given one section with ordered segments:
- Segments 1–3 cover one topic → grouped together.
- Segment 4 shifts topic → split into a new chunk.
- Segment 5 is very long and standalone → internally chunked by sentence/clause semantics.
- Any chunk exceeding hard token limits → split safely in final enforcement.
Patot is suitable for:
- RAG pipelines over Sefaria corpora
- Source-aware Q&A assistants
- Curriculum and class prep tools
- Topic clustering and thematic analysis
- Source sheet recommendation workflows
Patot is not on PyPI. Install directly from GitHub:
pip install "patot[chunking,pdf] @ git+https://github.com/Sefaria/patot@v0.1.0"- Core install (
pip install patot) provides JSON segment loading and the Gemini embedding client/cache. [chunking]adds the statistical chunker (PatotChunker), which depends ontransformers,huggingface-hub,semantic-chunkers, andsemantic-router.[pdf]addspatot.debug_reportfor rendering chunking debug traces to PDF, which depends onreportlabandpython-bidi.
from patot import ChunkerConfig, PatotChunker, load_segment_records_from_section
config = ChunkerConfig(debug=False)
chunker = PatotChunker(api_key="...", config=config) # Google Gemini API key
result = chunker.chunk_segments(segment_records)pip install -e ".[chunking,pdf]"
pip install pytest
pytestPatot's semantic-first strategy is built on semantic-chunkers by Aurelio Labs.
- Repo: https://github.com/aurelio-labs/semantic-chunkers
- Intro notebook: https://github.com/aurelio-labs/semantic-chunkers/blob/main/docs/00-chunkers-intro.ipynb
GPL-3.0
