Skip to content

[RFC]: Add offline attachment lowering and OCR fallback for text-only models #477

Description

@hiqiancheng

Summary

Add a fully offline attachment lowering layer for TouchAI so text-only models can still consume image and file attachments through deterministic text conversion. The first scope includes image OCR lowering, text/code/structured-text lowering, PDF text extraction, request-time prompt snapshot persistence of lowered content, and a dedicated local OCRService backed by an ONNX Runtime sidecar running PP-OCRv6 Tiny.

Motivation

TouchAI already has attachment inspection, persistence, prompt transport, and session replay, but unsupported image/file attachments are currently blocked before submit or omitted from provider transport for text-only models. That leaves no path for "the model cannot consume this attachment natively, but TouchAI can lower it into text first." This RFC adds that missing path while preserving the existing native multimodal flow for capable models.

Affected boundaries

  • AgentService
  • conversation runtime
  • tool execution
  • session persistence
  • context construction
  • instruction loading
  • agent orchestration
  • MCP integration
  • database schema or migrations

Proposed design

  • Add a dedicated apps/desktop/src/services/AgentService/infrastructure/attachments/lowering/ subsystem to own attachment delivery decisions and lowering strategies.
  • Keep native multimodal delivery for models that support image/file attachments.
  • Introduce a separate delivery decision model per attachment: native, lowered, or blocked.
  • Use OCR lowering for images and screenshots when the model lacks image support.
  • Use direct text lowering for text, code, and structured-text attachments when the model lacks file support.
  • Use PDF text extraction as the v1 fallback for unsupported PDFs. Scanned-PDF OCR fallback is explicitly out of scope for v1.
  • Keep original attachments persisted as attachments, but store lowered request-time truth in PromptSnapshot.loweredAttachments so history replays what the model actually received.
  • Add a dedicated OCRService that only performs OCR. It does not decide if OCR should run and does not format prompt content.
  • Back OCRService with a local offline ONNX Runtime sidecar running PP-OCRv6 Tiny.
  • Update the SearchView send path from a binary supported/unsupported model to a three-state model: supported, will-lower, blocked.

Alternatives and trade-offs

  1. Keep the current vision-first-only behavior.
    • Rejected because text-only models would remain unable to use image/file context at all.
  2. Add a local vision-model fallback instead of OCR/text lowering.
    • Rejected for this scope because it increases runtime and packaging complexity and is not required for the approved design.
  3. Use a Python PaddleOCR sidecar instead of ONNX Runtime.
    • Rejected as the primary path because Python packaging and distribution are heavier for a desktop application. ONNX Runtime provides a tighter offline packaging story once the sidecar contract exists.
  4. Put lowering logic directly into prompt transport or runtime.
    • Rejected because attachment lowering is fundamentally an attachment delivery concern and belongs with attachment inspection/materialization boundaries, not with message formatting.

Upstream references

Testing and rollout

Recommended slices:

  1. Add attachment lowering types and resolver with mocked strategies.
  2. Wire runtime and prompt transport to consume lowered blocks.
  3. Persist lowered blocks in prompt snapshot and replay them in history.
  4. Add OCRService contract and a mocked local OCR implementation.
  5. Replace mock OCR with the ONNX Runtime sidecar.
  6. Refine UI states from unsupported to will-lower and blocked.
  7. Add caching and hardening.

Verification should include runtime prompt construction, session replay stability across model switches, UI submission behavior for will-lower vs blocked, OCR availability failure handling, and mixed native/lowered attachment flows. Main risks are AgentService boundary churn, prompt snapshot replay correctness, and desktop packaging for the OCR sidecar.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:agent-serviceAgentService and conversation runtime changesarea:tauriTauri shell or desktop runtime changeskind:rfcArchitecture or cross-cutting design discussionstatus:triageNeeds maintainer triage

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions