Skip to content

Add extract_screen_text tool (Vision OCR + optional visual summary)#6

Open
grohith327 wants to merge 1 commit into
mainfrom
feat/text-extraction
Open

Add extract_screen_text tool (Vision OCR + optional visual summary)#6
grohith327 wants to merge 1 commit into
mainfrom
feat/text-extraction

Conversation

@grohith327

Copy link
Copy Markdown
Collaborator

Summary

Adds the extract_screen_text MCP tool: capture the active display and extract its visible text with macOS Vision OCR, returning structured JSON (text, line_count, average_confidence, screenshot_path). When include_visual_summary=true, it additionally requests a macOS 27 Foundation Models visual summary, degrading gracefully to OCR-only with a visual_error when that capability is unavailable.

Ported from the sibling branch feat/extract-screen-text with two intentional deviations:

  • Bug fix: removed the undefined OCRTool() reference in the Foundation Models path (LanguageModelSession(model: model, tools: [OCRTool()])LanguageModelSession(model: model)), which broke compilation on macOS 27 toolchains.
  • Dropped the empty package.json / package-lock.json (this is a Python/uv project).

The altic-studio skill (new Mode B2) and README are updated so the model knows when and how to use the tool.

Changes

  • tools/screen_text.py (new) — Python wrapper (OCR + optional FM summary, JSON output, max_chars truncation)
  • tools/scripts/extract-screen-text.swift (new) — ScreenCaptureKit capture + Vision OCR + gated macOS 27 FM summary
  • skills/altic-studio/scripts/extract-screen-text.swift (new) — identical mirror
  • tests/test_screen_text.py (new) — 7 unit tests
  • server.py, SKILL.md, scripts/README.md, README.md — registration + docs

Test plan

  • uv run pytest tests/test_screen_text.py -q7 passed
  • uv run pytest -q (full suite) → 39 passed
  • import server confirms extract_screen_text is registered
  • swiftc -typecheck tools/scripts/extract-screen-text.swiftexit 0 on macOS 27 (FoundationModels branch compiled — confirms the OCRTool() fix)
  • ⚠️ End-to-end capture/OCR and the FM runtime summary were not run (require Screen Recording grant + a foreground window); see README "Manual Smoke Tests For Screen Text Tools".

Capture the active display and extract visible text via macOS Vision OCR,
returning structured JSON (text, line_count, average_confidence,
screenshot_path). When include_visual_summary=true, additionally request a
macOS 27 Foundation Models summary, degrading gracefully to OCR-only with a
visual_error when unavailable.

Ported from the sibling branch feat/extract-screen-text, with the undefined
OCRTool() reference in the Foundation Models path removed (it broke compilation
on macOS 27 toolchains) and the empty package.json/package-lock.json cruft
dropped. Wires the tool into server.py and documents it in the altic-studio
skill and README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@grohith327

grohith327 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

PR Notes

HTML preview: https://altic-mcp-pr-6.vercel.app

Source commit: 032ddc2093ac4819a40ec37254031b1c211da80a
Branch: feat/text-extraction
Generated: 2026-06-18 00:49:11 UTC

Adds the extract_screen_text MCP tool (Vision OCR + optional macOS 27 Foundation Models visual summary), ported from a sibling branch with a Swift compile-bug fix. 39 tests pass; Swift type-checks clean on macOS 27.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant