feat(vision): image + screen input for jarvis ask by joshazmy · Pull Request #486 · open-jarvis/OpenJarvis

joshazmy · 2026-06-03T05:21:46Z

What does this PR do?

Adds image input to jarvis ask so OpenJarvis can use vision-capable local
models (e.g. gemma3:4b) without giving up its local-first promise.

-i / --image <file> — attach one or more images (repeatable).
-S / --screen — capture the primary monitor and send it.
Message.images (base64) → messages_to_dicts() → Ollama /api/chat
images. Text-only requests are unchanged.
Vision auto-routes to direct-to-engine mode; with an explicit --agent it
warns instead of silently dropping the image.
Privacy guard warns before sending an image to a non-local engine; the
security guardrail preserves images when sanitizing a flagged prompt.
Adds JARVIS_NUM_CTX (default 16384) to size the Ollama context window
for an image plus a conversation.

Closes #485.

How was this tested?

6 new unit tests in tests/test_vision.py covering the data-flow contract:
image forwarding to the serializer, text-only messages unaffected, the
guardrail preserving images through sanitization, and JARVIS_NUM_CTX
parsing (default + override + invalid-fallback). All pass locally.
Manually verified end to end on Windows with Ollama + gemma3:4b (both
--image and --screen), running 100% on-GPU.
Honest caveat: the non-Windows screen-capture path (mss / Pillow) is
implemented but I have only verified the Windows .NET path on real
hardware — a macOS/Linux check would be appreciated.

Checklist

Tests pass — new tests/test_vision.py passes locally; the full suite
runs on CI (my local venv lacks some optional engine test deps, e.g.
respx).
Linter passes (uv run ruff check src/ tests/)
Formatter passes on the changed files (ruff format --check). I left the
78 unrelated pre-existing files alone — they differ only because of a ruff
version mismatch, not this change.
New/changed public API has docstrings
Follows registry pattern — N/A; this wraps the existing engine/CLI and
adds no new registry component.
Documentation updated — docs/user-guide/cli.md (new "Vision Input"
section + flag table) and CHANGELOG.md.

OpenJarvis can run vision-capable local models (gemma3, qwen2.5-vl), but the CLI had no way to send them a picture -- the Ollama engine only serialized text. This adds end-to-end image input. What's new - `jarvis ask -i/--image <file>` attaches one or more images to the query. - `jarvis ask -S/--screen` captures the primary monitor (dependency-free on Windows via .NET; mss/Pillow fallback elsewhere). - Vision auto-routes to direct-to-engine mode; with an explicit --agent it warns rather than silently dropping the image. - Privacy guard: warns before sending an image to a non-local engine, keeping OpenJarvis local-first by default. - Context-window default raised 8k -> 16k (JARVIS_NUM_CTX) so an image plus a conversation fit. Implementation - Message.images carries base64 data; messages_to_dicts() forwards it to Ollama's /api/chat "images" field. Text-only messages are unchanged. - GuardrailsEngine preserves images when it rewrites a flagged message. Tests (tests/test_vision.py, 6/6 pass, ruff-clean) - payload forwarding, text path untouched, num_ctx override, guardrail image preservation. Verified on AMD RX 9070 XT (Ollama/Vulkan, 100% GPU) with gemma3:4b: solid-color image, file image, and live screen capture all described. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

joshazmy requested review from ANarayan, jonsaadfalcon and robbym-dev as code owners June 3, 2026 05:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vision): image + screen input for jarvis ask#486

feat(vision): image + screen input for jarvis ask#486
joshazmy wants to merge 1 commit into
open-jarvis:mainfrom
joshazmy:feature/local-vision

joshazmy commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joshazmy commented Jun 3, 2026

What does this PR do?

How was this tested?

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant