Summary
Add support for processing image inputs in text/transcript mode, enabling CAAL to answer questions about images, identify objects, read text from photos, etc., through VLMs like Qwen3-VL and LLaVA.
Proposed Solution
1. Image Input Pipeline
- Add image upload button to chat interface
- Display uploaded images in conversation history
2. Provider Support
- Detect VLM models
- Format image messages for Ollama API:
{
"role": "user",
"content": "What's in this image?",
"images": ["base64_encoded_image_data"]
}
Use Case
1. Vision Assistance
- "What's in this image?" - General image description
- "Read the text from this receipt" - OCR functionality
2. Smart Home Integration
- "What's in my fridge?" - Using camera input
- "Is there a package on my doorstep?" - Using doorbell camera
Additional Context
Performance:
- Longer inference time for vision models
- Image preprocessing (resize to 512x512 or 768x768 for faster inference)
User experience:
- Display image thumbnails in chat history
- Visual feedback when image is being processed
- Error handling for unsupported formats
Compatibility
- Text-only models continue to work unchanged
- VLM support is opt-in (ideally auto-detected)
- If manually enabled, image messages should be ignored by non-VLM providers
- Should work with any VL model (similar to LM Studio)
Summary
Add support for processing image inputs in text/transcript mode, enabling CAAL to answer questions about images, identify objects, read text from photos, etc., through VLMs like Qwen3-VL and LLaVA.
Proposed Solution
1. Image Input Pipeline
2. Provider Support
{ "role": "user", "content": "What's in this image?", "images": ["base64_encoded_image_data"] }Use Case
1. Vision Assistance
2. Smart Home Integration
Additional Context
Performance:
User experience:
Compatibility