yomail extracts the message content from Japanese business emails. It uses a CRF (Conditional Random Field) model to classify each line, then assembles the content from labeled lines.
- Handles formal and informal Japanese business emails
- Extracts greeting, body, and closing as unified message content
- Excludes signatures and trailing/leading quoted content
- Works with forwarded emails, replies, and inline quotes
- Returns confidence scores for quality control
- Small model size (12 KB)
- Fast inference (~10-30ms)
pip install yomail
Requires Python 3.12+.
from yomail import EmailBodyExtractor
extractor = EmailBodyExtractor()
# Raises on failure
body = extractor.extract(email_text)
# Returns None on failure
body = extractor.extract_safe(email_text)
# Full result with metadata
result = extractor.extract_with_metadata(email_text)
print(result.body)
print(result.confidence)
print(result.signature_detected)Input:
株式会社サンプル
田中様
お世話になっております。
山田です。
先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。
以上
--
山田太郎
株式会社テスト
TEL: 03-1234-5678
Output:
お世話になっております。
山田です。
先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。
以上
The extraction pipeline:
- Normalize — Line endings, neologdn normalization, NFKC
- Analyze structure — Quote depth, forward/reply headers, delimiters
- Extract features — Position, character ratios, pattern matches
- Label with CRF — GREETING, BODY, CLOSING, SIGNATURE, QUOTE, OTHER
- Assemble body — Find signature boundary, handle inline quotes, merge blocks
See ARCHITECTURE.md for details and API.md for the full API reference.
| Label | Description |
|---|---|
| GREETING | Opening (お世話になっております) |
| BODY | Main content |
| CLOSING | Closing (よろしくお願いいたします) |
| SIGNATURE | Sender information |
| QUOTE | Quoted content |
| OTHER | Separators, blank lines |
Evaluated on 19,642 synthetic test emails:
| Metric | Value |
|---|---|
| Content match | 97.9% |
| Acceptable rate | 98.0% |
| Confident wrong | 0.14% |
See PERFORMANCE.md for details.
from yomail import (
ExtractionError, # Base class
InvalidInputError, # Empty or invalid input
NoBodyDetectedError, # No body found
LowConfidenceError, # Confidence below threshold
)extractor = EmailBodyExtractor(
model_path="path/to/model.crfsuite", # Custom model
confidence_threshold=0.5, # Minimum confidence
)# Setup
uv sync
# Run tests
uv run pytest
# Type check
uv run ty check
# Lint
uv run ruff check .Training data is generated by the yasumail project.
# Train model
python scripts/train.py data/training.jsonl -o models/email_body.crfsuite
# Evaluate
python scripts/evaluate.py data/test.jsonl- neologdn — Japanese text normalization
- python-crfsuite — CRF implementation
- PyYAML — Name data loading