fix(browser_planner): gate multimodal content on per-model vision support by haroldship · Pull Request #251 · cuga-project/cuga-agent

haroldship · 2026-05-20T13:49:34Z

Bug fix

Fixes #214

Summary

When advanced_features.use_vision = true (the default), load_prompt_with_image unconditionally builds the browser planner's user message as a list-typed multimodal [image_url, text] block. LLM endpoints that don't accept multimodal content (e.g. WatsonX served via an openai-compatible gateway) then reject the request with 400 messages[1].content must be a string, killing hybrid-mode web-agent runs before the first step.

This change introduces a per-model enable_vision flag (resolved by the new model_supports_vision helper, defaulting to True to preserve existing behavior for vision-capable models). Both the prompt template structure (load_prompt_with_image) and the runtime image attachment (BrowserPlannerAgent.run) now gate on settings.advanced_features.use_vision AND model_supports_vision(model_config). Operators whose LLM endpoint rejects multimodal content can set enable_vision = false on the relevant agent.*.model config (e.g. agent.planner.model) to force string-only user content.

Why a per-model flag

The bug surfaces when the endpoint (not the model) cannot deserialize multimodal content blocks. Detecting endpoint capabilities at runtime is brittle, and silently downgrading vision would mask misconfigurations. An explicit enable_vision = false opt-out is small, discoverable, and additive — existing model configs continue to work unchanged because the default is True.

Testing

Rebased onto main (69b7e40a) — branch contains a single commit 9b7bad82.

New regression tests in tests/unit/test_load_prompt_with_image_vision_gate.py (7 tests):

model_supports_vision defaults: None config and missing-attribute config both default to True; explicit enable_vision=False and enable_vision=True are honored.
Human message is a string when the model disables vision, even with global use_vision=True (the regression for [Bug]: Web agent in hybrid mode fails with "messages[1].content must be a string" (400 invalid_request_error) #214).
Human message is multimodal (list with image_url + text items) when both global setting and per-model flag are True.
Human message is a string when the global setting is False regardless of per-model flag.

Test results (against this branch, rebased on `main`)

Suite	Result
`tests/unit/test_load_prompt_with_image_vision_gate.py` (new)	✅ 7 / 7 passed
`src/cuga/backend/tools_env/registry/tests/`	✅ all passed
`src/cuga/backend/cuga_graph/nodes/api/variables_manager/tests/`	✅ all passed
`src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/tests/` (non-e2b)	✅ all passed
`tests/unit/test_task_analyzer_app_matching.py`	✅ all passed
`tests/unit/test_cuga_lite_bind_tools.py`	✅ all passed
`src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/tests/test_e2b_lite.py`	⚠️ 3 failed — `RuntimeError: e2b-code-interpreter package not installed` (env-only; CI runs these with `uv sync --extra e2b` per `src/scripts/run_tests.sh`)
`src/cuga/sdk_core/tests/` (live LLM integration)	⚠️ flaky locally — pass individually, fail when run in large batches due to external-service rate limits / setup. Unrelated to this diff.
`ruff check` (touched files)	✅ clean
`ruff format --check` (touched files)	✅ clean
CodeRabbit CLI (`--type committed --base main`)	✅ no findings against this diff

Aggregate: 211 passing across the unit-test subset relevant to this change, with the only failures being env/network issues that also fail on main.

Verified fix locally; tests pass

Summary by CodeRabbit

Bug Fixes
- Enhanced multimodal content handling to validate model vision capabilities before attaching images, preventing errors when models reject image content
- Added per-model vision capability checks alongside global settings for more granular control
Tests
- Added comprehensive test coverage for vision feature gating and multimodal fallback behavior across different model configurations

coderabbitai · 2026-05-20T13:50:28Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1884a26-b87a-4378-8595-461fdde6c4cd

📥 Commits

Reviewing files that changed from the base of the PR and between 819d5b5 and 9b7bad8.

📒 Files selected for processing (3)

src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.py
src/cuga/backend/llm/utils/helpers.py
tests/unit/test_load_prompt_with_image_vision_gate.py

📝 Walkthrough

Walkthrough

The PR fixes issue #214 by introducing per-model vision capability detection. It prevents multimodal message content from being sent to models that reject list-typed content, using both global settings and model-specific configuration to gate vision features.

Changes

Per-model vision capability detection and gating

Layer / File(s)	Summary
Model vision capability detection and prompt formatting `src/cuga/backend/llm/utils/helpers.py`	New `model_supports_vision()` helper checks model config for `enable_vision` capability. `load_prompt_with_image()` gates multimodal prompt selection using both global `settings.advanced_features.use_vision` flag and per-model capability.
BrowserPlannerAgent vision capability integration `src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.py`	Constructor accepts injected `use_vision_effective` flag. `run()` uses the flag to conditionally set prompt variables and attach images. `create()` computes `use_vision_effective` from global setting and model capability, passing it to constructor.
Vision gating regression tests `tests/unit/test_load_prompt_with_image_vision_gate.py`	Tests validate `model_supports_vision()` defaulting behavior, model/global gating of multimodal content, and correct message formatting when vision is enabled or disabled.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A model cried "no lists for me!"
So we check its vision capability,
Global settings dance with model config,
Multimodal prompts now un-block,
Web agents skip the image grief! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 54.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(browser_planner): gate multimodal content on per-model vision support' accurately describes the main change: introducing per-model vision support gating for multimodal content in the browser planner agent.
Linked Issues check	✅ Passed	The changes address all coding objectives from `#214`: preventing list-typed content for models that require strings, making multimodal behavior configurable per-model, preserving default behavior, and adding comprehensive tests.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to `#214`: the new model_supports_vision helper, BrowserPlannerAgent refactoring, load_prompt_with_image gating, and unit tests all address the multimodal content rejection issue without introducing unrelated modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/214-hybrid-web-agent-multimodal-content

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cuga/sdk_core/tests/test_sdk_token_usage_tracker.py`:
- Around line 86-118: The test currently creates a local ActivityTracker and
compares it with the tracker inside TokenUsageTracker, but the SDK should expose
a shared tracker on the agent; update the test to use the agent's shared tracker
instead of creating a new one. Specifically, remove the local `tracker =
ActivityTracker()` and instead obtain the tracker from the agent (e.g., `tracker
= agent._activity_tracker` or after building callbacks assert
`token_tracker.tracker is agent._activity_tracker`), reset `tracker.prompts =
[]`, then proceed to call `tracker.collect_prompt(...)` and assert prompts; also
replace the identity assertion `token_tracker.tracker is tracker` with
`token_tracker.tracker is agent._activity_tracker` (or equivalent) to match the
fixed SDK behavior.

In `@src/cuga/sdk.py`:
- Around line 1346-1366: The ActivityTracker is recreated each time in
_build_callbacks(), preventing accumulation across calls; initialize
self._activity_tracker once in __init__ (create ActivityTracker() there) and
then change _build_callbacks() to use that instance (e.g.,
TokenUsageTracker(self._activity_tracker)) so all invocations from
_apply_callbacks(), invoke(), stream() and graph creation share the same
tracker; if needed, lazily create self._activity_tracker inside
_build_callbacks() only when it is None to preserve backward compatibility.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aeaf6429-3498-49e7-814c-ae256d66bb68

📥 Commits

Reviewing files that changed from the base of the PR and between 69b7e40 and 819d5b5.

📒 Files selected for processing (5)

src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.py
src/cuga/backend/llm/utils/helpers.py
src/cuga/sdk.py
src/cuga/sdk_core/tests/test_sdk_token_usage_tracker.py
tests/unit/test_load_prompt_with_image_vision_gate.py

…n support When the global `advanced_features.use_vision = true` setting is on, `load_prompt_with_image` would unconditionally wrap the browser planner's human message as a list-typed multimodal `[image_url, text]` block. LLM endpoints that don't accept multimodal content (e.g. WatsonX served via an openai-compatible gateway) then reject the request with `400 messages[1].content must be a string`, killing hybrid-mode web-agent runs before the first step (issue #214). Add a per-model `enable_vision` flag (resolved by `model_supports_vision`, defaulting to `True` to preserve existing behavior for vision-capable models) and gate both the prompt template structure and the runtime image attachment on `use_vision AND model_supports_vision`. Operators whose LLM endpoint rejects multimodal content can set `enable_vision = false` on the relevant `agent.*.model` config to force string-only user content. Closes #214

haroldship · 2026-05-20T15:57:56Z

@coderabbitai review

coderabbitai · 2026-05-20T15:58:04Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

sami-marreed · 2026-05-20T16:21:38Z

+    """
+    if model_config is None:
+        return True
+    return bool(getattr(model_config, 'enable_vision', True))


so this checks if the enable_vision is given from model config? not like based on model e.g. claude 4.6 supports vision why do we need to add flag

Good question — the short answer is the bug isn't actually about model capability; it's about endpoint capability.

The 400 in #214 came from a setup where the model itself does support vision (Llama 4 Maverick served via WatsonX, which is multimodal-capable), but the openai-compatible gateway in front of it doesn't accept multimodal content blocks — it rejects messages[1].content if it's a list. So a name-based allowlist ("claude-4.6 → multimodal OK") would have given the wrong answer here: it would have said yes, sent the list-typed payload, and produced the same 400.

The reason for the model name failing as a signal:

Same model, multiple endpoints — e.g. WatsonX-via-openai-compatible vs WatsonX native. One accepts multimodal, the other doesn't.

Provider names drift fast (litellm prefixes, openrouter aliases, custom model_name strings). A registry of "these names support vision" would need constant upkeep and would still miss the gateway dimension.

Silently downgrading vision when we think the model doesn't support it would mask configuration mistakes — e.g. someone pointing at the wrong endpoint and not noticing because we quietly stripped images.

The enable_vision flag defaults to True, so anyone whose endpoint already works keeps working with no change. The only people who need to set it are those with a non-multimodal gateway — and for them, an explicit opt-out is more discoverable than "why did my screenshots stop being sent?". Open to a follow-up that adds a name-based default if that's useful for the common case, but I'd want it to be additive (a hint, not authoritative) so we can still override per endpoint.

so can we try catch error when image doesn't work and simply then try without

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread src/cuga/sdk_core/tests/test_sdk_token_usage_tracker.py

Comment thread src/cuga/sdk.py

haroldship force-pushed the fix/214-hybrid-web-agent-multimodal-content branch from 819d5b5 to 9b7bad8 Compare May 20, 2026 14:35

haroldship requested a review from sami-marreed May 20, 2026 16:19

sami-marreed reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(browser_planner): gate multimodal content on per-model vision support#251

fix(browser_planner): gate multimodal content on per-model vision support#251
haroldship wants to merge 1 commit into
mainfrom
fix/214-hybrid-web-agent-multimodal-content

haroldship commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

haroldship commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026

Uh oh!

sami-marreed May 20, 2026

Uh oh!

haroldship May 20, 2026

Uh oh!

sami-marreed May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haroldship commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug fix

Summary

Why a per-model flag

Testing

Test results (against this branch, rebased on main)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

haroldship commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026

Uh oh!

sami-marreed May 20, 2026

Choose a reason for hiding this comment

Uh oh!

haroldship May 20, 2026

Choose a reason for hiding this comment

Uh oh!

sami-marreed May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haroldship commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Test results (against this branch, rebased on `main`)

coderabbitai Bot commented May 20, 2026 •

edited

Loading