Skip to content

fix(browser_planner): gate multimodal content on per-model vision support#251

Open
haroldship wants to merge 1 commit into
mainfrom
fix/214-hybrid-web-agent-multimodal-content
Open

fix(browser_planner): gate multimodal content on per-model vision support#251
haroldship wants to merge 1 commit into
mainfrom
fix/214-hybrid-web-agent-multimodal-content

Conversation

@haroldship
Copy link
Copy Markdown
Collaborator

@haroldship haroldship commented May 20, 2026

Bug fix

Fixes #214

Summary

When advanced_features.use_vision = true (the default), load_prompt_with_image unconditionally builds the browser planner's user message as a list-typed multimodal [image_url, text] block. LLM endpoints that don't accept multimodal content (e.g. WatsonX served via an openai-compatible gateway) then reject the request with 400 messages[1].content must be a string, killing hybrid-mode web-agent runs before the first step.

This change introduces a per-model enable_vision flag (resolved by the new model_supports_vision helper, defaulting to True to preserve existing behavior for vision-capable models). Both the prompt template structure (load_prompt_with_image) and the runtime image attachment (BrowserPlannerAgent.run) now gate on settings.advanced_features.use_vision AND model_supports_vision(model_config). Operators whose LLM endpoint rejects multimodal content can set enable_vision = false on the relevant agent.*.model config (e.g. agent.planner.model) to force string-only user content.

Why a per-model flag

The bug surfaces when the endpoint (not the model) cannot deserialize multimodal content blocks. Detecting endpoint capabilities at runtime is brittle, and silently downgrading vision would mask misconfigurations. An explicit enable_vision = false opt-out is small, discoverable, and additive — existing model configs continue to work unchanged because the default is True.

Testing

Rebased onto main (69b7e40a) — branch contains a single commit 9b7bad82.

New regression tests in tests/unit/test_load_prompt_with_image_vision_gate.py (7 tests):

  • model_supports_vision defaults: None config and missing-attribute config both default to True; explicit enable_vision=False and enable_vision=True are honored.
  • Human message is a string when the model disables vision, even with global use_vision=True (the regression for [Bug]: Web agent in hybrid mode fails with "messages[1].content must be a string" (400 invalid_request_error) #214).
  • Human message is multimodal (list with image_url + text items) when both global setting and per-model flag are True.
  • Human message is a string when the global setting is False regardless of per-model flag.

Test results (against this branch, rebased on main)

Suite Result
tests/unit/test_load_prompt_with_image_vision_gate.py (new) ✅ 7 / 7 passed
src/cuga/backend/tools_env/registry/tests/ ✅ all passed
src/cuga/backend/cuga_graph/nodes/api/variables_manager/tests/ ✅ all passed
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/tests/ (non-e2b) ✅ all passed
tests/unit/test_task_analyzer_app_matching.py ✅ all passed
tests/unit/test_cuga_lite_bind_tools.py ✅ all passed
src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/tests/test_e2b_lite.py ⚠️ 3 failed — RuntimeError: e2b-code-interpreter package not installed (env-only; CI runs these with uv sync --extra e2b per src/scripts/run_tests.sh)
src/cuga/sdk_core/tests/ (live LLM integration) ⚠️ flaky locally — pass individually, fail when run in large batches due to external-service rate limits / setup. Unrelated to this diff.
ruff check (touched files) ✅ clean
ruff format --check (touched files) ✅ clean
CodeRabbit CLI (--type committed --base main) ✅ no findings against this diff

Aggregate: 211 passing across the unit-test subset relevant to this change, with the only failures being env/network issues that also fail on main.

  • Verified fix locally; tests pass

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced multimodal content handling to validate model vision capabilities before attaching images, preventing errors when models reject image content
    • Added per-model vision capability checks alongside global settings for more granular control
  • Tests

    • Added comprehensive test coverage for vision feature gating and multimodal fallback behavior across different model configurations

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1884a26-b87a-4378-8595-461fdde6c4cd

📥 Commits

Reviewing files that changed from the base of the PR and between 819d5b5 and 9b7bad8.

📒 Files selected for processing (3)
  • src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.py
  • src/cuga/backend/llm/utils/helpers.py
  • tests/unit/test_load_prompt_with_image_vision_gate.py

📝 Walkthrough

Walkthrough

The PR fixes issue #214 by introducing per-model vision capability detection. It prevents multimodal message content from being sent to models that reject list-typed content, using both global settings and model-specific configuration to gate vision features.

Changes

Per-model vision capability detection and gating

Layer / File(s) Summary
Model vision capability detection and prompt formatting
src/cuga/backend/llm/utils/helpers.py
New model_supports_vision() helper checks model config for enable_vision capability. load_prompt_with_image() gates multimodal prompt selection using both global settings.advanced_features.use_vision flag and per-model capability.
BrowserPlannerAgent vision capability integration
src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.py
Constructor accepts injected use_vision_effective flag. run() uses the flag to conditionally set prompt variables and attach images. create() computes use_vision_effective from global setting and model capability, passing it to constructor.
Vision gating regression tests
tests/unit/test_load_prompt_with_image_vision_gate.py
Tests validate model_supports_vision() defaulting behavior, model/global gating of multimodal content, and correct message formatting when vision is enabled or disabled.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A model cried "no lists for me!"
So we check its vision capability,
Global settings dance with model config,
Multimodal prompts now un-block,
Web agents skip the image grief! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(browser_planner): gate multimodal content on per-model vision support' accurately describes the main change: introducing per-model vision support gating for multimodal content in the browser planner agent.
Linked Issues check ✅ Passed The changes address all coding objectives from #214: preventing list-typed content for models that require strings, making multimodal behavior configurable per-model, preserving default behavior, and adding comprehensive tests.
Out of Scope Changes check ✅ Passed All changes are directly scoped to #214: the new model_supports_vision helper, BrowserPlannerAgent refactoring, load_prompt_with_image gating, and unit tests all address the multimodal content rejection issue without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/214-hybrid-web-agent-multimodal-content

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cuga/sdk_core/tests/test_sdk_token_usage_tracker.py`:
- Around line 86-118: The test currently creates a local ActivityTracker and
compares it with the tracker inside TokenUsageTracker, but the SDK should expose
a shared tracker on the agent; update the test to use the agent's shared tracker
instead of creating a new one. Specifically, remove the local `tracker =
ActivityTracker()` and instead obtain the tracker from the agent (e.g., `tracker
= agent._activity_tracker` or after building callbacks assert
`token_tracker.tracker is agent._activity_tracker`), reset `tracker.prompts =
[]`, then proceed to call `tracker.collect_prompt(...)` and assert prompts; also
replace the identity assertion `token_tracker.tracker is tracker` with
`token_tracker.tracker is agent._activity_tracker` (or equivalent) to match the
fixed SDK behavior.

In `@src/cuga/sdk.py`:
- Around line 1346-1366: The ActivityTracker is recreated each time in
_build_callbacks(), preventing accumulation across calls; initialize
self._activity_tracker once in __init__ (create ActivityTracker() there) and
then change _build_callbacks() to use that instance (e.g.,
TokenUsageTracker(self._activity_tracker)) so all invocations from
_apply_callbacks(), invoke(), stream() and graph creation share the same
tracker; if needed, lazily create self._activity_tracker inside
_build_callbacks() only when it is None to preserve backward compatibility.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aeaf6429-3498-49e7-814c-ae256d66bb68

📥 Commits

Reviewing files that changed from the base of the PR and between 69b7e40 and 819d5b5.

📒 Files selected for processing (5)
  • src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.py
  • src/cuga/backend/llm/utils/helpers.py
  • src/cuga/sdk.py
  • src/cuga/sdk_core/tests/test_sdk_token_usage_tracker.py
  • tests/unit/test_load_prompt_with_image_vision_gate.py

Comment thread src/cuga/sdk_core/tests/test_sdk_token_usage_tracker.py
Comment thread src/cuga/sdk.py
…n support

When the global `advanced_features.use_vision = true` setting is on,
`load_prompt_with_image` would unconditionally wrap the browser planner's
human message as a list-typed multimodal `[image_url, text]` block. LLM
endpoints that don't accept multimodal content (e.g. WatsonX served via an
openai-compatible gateway) then reject the request with
`400 messages[1].content must be a string`, killing hybrid-mode web-agent
runs before the first step (issue #214).

Add a per-model `enable_vision` flag (resolved by `model_supports_vision`,
defaulting to `True` to preserve existing behavior for vision-capable
models) and gate both the prompt template structure and the runtime image
attachment on `use_vision AND model_supports_vision`. Operators whose LLM
endpoint rejects multimodal content can set `enable_vision = false` on the
relevant `agent.*.model` config to force string-only user content.

Closes #214
@haroldship haroldship force-pushed the fix/214-hybrid-web-agent-multimodal-content branch from 819d5b5 to 9b7bad8 Compare May 20, 2026 14:35
@haroldship
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@haroldship haroldship requested a review from sami-marreed May 20, 2026 16:19
"""
if model_config is None:
return True
return bool(getattr(model_config, 'enable_vision', True))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this checks if the enable_vision is given from model config? not like based on model e.g. claude 4.6 supports vision why do we need to add flag

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — the short answer is the bug isn't actually about model capability; it's about endpoint capability.

The 400 in #214 came from a setup where the model itself does support vision (Llama 4 Maverick served via WatsonX, which is multimodal-capable), but the openai-compatible gateway in front of it doesn't accept multimodal content blocks — it rejects messages[1].content if it's a list. So a name-based allowlist ("claude-4.6 → multimodal OK") would have given the wrong answer here: it would have said yes, sent the list-typed payload, and produced the same 400.

The reason for the model name failing as a signal:

  • Same model, multiple endpoints — e.g. WatsonX-via-openai-compatible vs WatsonX native. One accepts multimodal, the other doesn't.
  • Provider names drift fast (litellm prefixes, openrouter aliases, custom model_name strings). A registry of "these names support vision" would need constant upkeep and would still miss the gateway dimension.
  • Silently downgrading vision when we think the model doesn't support it would mask configuration mistakes — e.g. someone pointing at the wrong endpoint and not noticing because we quietly stripped images.

The enable_vision flag defaults to True, so anyone whose endpoint already works keeps working with no change. The only people who need to set it are those with a non-multimodal gateway — and for them, an explicit opt-out is more discoverable than "why did my screenshots stop being sent?". Open to a follow-up that adds a name-based default if that's useful for the common case, but I'd want it to be additive (a hint, not authoritative) so we can still override per endpoint.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so can we try catch error when image doesn't work and simply then try without

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Web agent in hybrid mode fails with "messages[1].content must be a string" (400 invalid_request_error)

2 participants