Skip to content

[FEATURE] Add image input support for Vision-Language models (e.g. Qwen3-VL) #37

Description

@AbdulShahzeb

Summary

Add support for processing image inputs in text/transcript mode, enabling CAAL to answer questions about images, identify objects, read text from photos, etc., through VLMs like Qwen3-VL and LLaVA.

Proposed Solution

1. Image Input Pipeline

  • Add image upload button to chat interface
  • Display uploaded images in conversation history

2. Provider Support

  • Detect VLM models
  • Format image messages for Ollama API:
{
  "role": "user",
  "content": "What's in this image?",
  "images": ["base64_encoded_image_data"]
}

Use Case

1. Vision Assistance

  • "What's in this image?" - General image description
  • "Read the text from this receipt" - OCR functionality

2. Smart Home Integration

  • "What's in my fridge?" - Using camera input
  • "Is there a package on my doorstep?" - Using doorbell camera

Additional Context

Performance:

  • Longer inference time for vision models
  • Image preprocessing (resize to 512x512 or 768x768 for faster inference)

User experience:

  • Display image thumbnails in chat history
  • Visual feedback when image is being processed
  • Error handling for unsupported formats

Compatibility

  • Text-only models continue to work unchanged
  • VLM support is opt-in (ideally auto-detected)
  • If manually enabled, image messages should be ignored by non-VLM providers
  • Should work with any VL model (similar to LM Studio)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions