Skip to content

terallite/yomail

Repository files navigation

yomail (読メール)

yomail extracts the message content from Japanese business emails. It uses a CRF (Conditional Random Field) model to classify each line, then assembles the content from labeled lines.

Features

  • Handles formal and informal Japanese business emails
  • Extracts greeting, body, and closing as unified message content
  • Excludes signatures and trailing/leading quoted content
  • Works with forwarded emails, replies, and inline quotes
  • Returns confidence scores for quality control
  • Small model size (12 KB)
  • Fast inference (~10-30ms)

Installation

pip install yomail

Requires Python 3.12+.

Usage

from yomail import EmailBodyExtractor

extractor = EmailBodyExtractor()

# Raises on failure
body = extractor.extract(email_text)

# Returns None on failure
body = extractor.extract_safe(email_text)

# Full result with metadata
result = extractor.extract_with_metadata(email_text)
print(result.body)
print(result.confidence)
print(result.signature_detected)

Example

Input:

株式会社サンプル
田中様

お世話になっております。
山田です。

先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。

以上

--
山田太郎
株式会社テスト
TEL: 03-1234-5678

Output:

お世話になっております。
山田です。

先日ご依頼いただいた資料を添付いたします。
ご確認のほどよろしくお願いいたします。

以上

How It Works

The extraction pipeline:

  1. Normalize — Line endings, neologdn normalization, NFKC
  2. Analyze structure — Quote depth, forward/reply headers, delimiters
  3. Extract features — Position, character ratios, pattern matches
  4. Label with CRF — GREETING, BODY, CLOSING, SIGNATURE, QUOTE, OTHER
  5. Assemble body — Find signature boundary, handle inline quotes, merge blocks

See ARCHITECTURE.md for details and API.md for the full API reference.

Label Scheme

Label Description
GREETING Opening (お世話になっております)
BODY Main content
CLOSING Closing (よろしくお願いいたします)
SIGNATURE Sender information
QUOTE Quoted content
OTHER Separators, blank lines

Performance

Evaluated on 19,642 synthetic test emails:

Metric Value
Content match 97.9%
Acceptable rate 98.0%
Confident wrong 0.14%

See PERFORMANCE.md for details.

Exceptions

from yomail import (
    ExtractionError,      # Base class
    InvalidInputError,    # Empty or invalid input
    NoBodyDetectedError,  # No body found
    LowConfidenceError,   # Confidence below threshold
)

Configuration

extractor = EmailBodyExtractor(
    model_path="path/to/model.crfsuite",  # Custom model
    confidence_threshold=0.5,              # Minimum confidence
)

Development

# Setup
uv sync

# Run tests
uv run pytest

# Type check
uv run ty check

# Lint
uv run ruff check .

Training

Training data is generated by the yasumail project.

# Train model
python scripts/train.py data/training.jsonl -o models/email_body.crfsuite

# Evaluate
python scripts/evaluate.py data/test.jsonl

Dependencies

About

Extract body text from Japanese business emails

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages