Skip to content

feat(vision): image + screen input for jarvis ask#486

Open
joshazmy wants to merge 1 commit into
open-jarvis:mainfrom
joshazmy:feature/local-vision
Open

feat(vision): image + screen input for jarvis ask#486
joshazmy wants to merge 1 commit into
open-jarvis:mainfrom
joshazmy:feature/local-vision

Conversation

@joshazmy

@joshazmy joshazmy commented Jun 3, 2026

Copy link
Copy Markdown

What does this PR do?

Adds image input to jarvis ask so OpenJarvis can use vision-capable local
models (e.g. gemma3:4b) without giving up its local-first promise.

  • -i / --image <file> — attach one or more images (repeatable).
  • -S / --screen — capture the primary monitor and send it.
  • Message.images (base64) → messages_to_dicts() → Ollama /api/chat
    images. Text-only requests are unchanged.
  • Vision auto-routes to direct-to-engine mode; with an explicit --agent it
    warns instead of silently dropping the image.
  • Privacy guard warns before sending an image to a non-local engine; the
    security guardrail preserves images when sanitizing a flagged prompt.
  • Adds JARVIS_NUM_CTX (default 16384) to size the Ollama context window
    for an image plus a conversation.

Closes #485.

How was this tested?

  • 6 new unit tests in tests/test_vision.py covering the data-flow contract:
    image forwarding to the serializer, text-only messages unaffected, the
    guardrail preserving images through sanitization, and JARVIS_NUM_CTX
    parsing (default + override + invalid-fallback). All pass locally.
  • Manually verified end to end on Windows with Ollama + gemma3:4b (both
    --image and --screen), running 100% on-GPU.
  • Honest caveat: the non-Windows screen-capture path (mss / Pillow) is
    implemented but I have only verified the Windows .NET path on real
    hardware — a macOS/Linux check would be appreciated.

Checklist

  • Tests pass — new tests/test_vision.py passes locally; the full suite
    runs on CI (my local venv lacks some optional engine test deps, e.g.
    respx).
  • Linter passes (uv run ruff check src/ tests/)
  • Formatter passes on the changed files (ruff format --check). I left the
    78 unrelated pre-existing files alone — they differ only because of a ruff
    version mismatch, not this change.
  • New/changed public API has docstrings
  • Follows registry pattern — N/A; this wraps the existing engine/CLI and
    adds no new registry component.
  • Documentation updated — docs/user-guide/cli.md (new "Vision Input"
    section + flag table) and CHANGELOG.md.

OpenJarvis can run vision-capable local models (gemma3, qwen2.5-vl), but the
CLI had no way to send them a picture -- the Ollama engine only serialized
text. This adds end-to-end image input.

What's new
- `jarvis ask -i/--image <file>` attaches one or more images to the query.
- `jarvis ask -S/--screen` captures the primary monitor (dependency-free on
  Windows via .NET; mss/Pillow fallback elsewhere).
- Vision auto-routes to direct-to-engine mode; with an explicit --agent it
  warns rather than silently dropping the image.
- Privacy guard: warns before sending an image to a non-local engine,
  keeping OpenJarvis local-first by default.
- Context-window default raised 8k -> 16k (JARVIS_NUM_CTX) so an image plus
  a conversation fit.

Implementation
- Message.images carries base64 data; messages_to_dicts() forwards it to
  Ollama's /api/chat "images" field. Text-only messages are unchanged.
- GuardrailsEngine preserves images when it rewrites a flagged message.

Tests (tests/test_vision.py, 6/6 pass, ruff-clean)
- payload forwarding, text path untouched, num_ctx override, guardrail
  image preservation.

Verified on AMD RX 9070 XT (Ollama/Vulkan, 100% GPU) with gemma3:4b:
solid-color image, file image, and live screen capture all described.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vision input for jarvis ask (--image / --screen)

1 participant