fix(browser_planner): gate multimodal content on per-model vision support#251
fix(browser_planner): gate multimodal content on per-model vision support#251haroldship wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThe PR fixes issue ChangesPer-model vision capability detection and gating
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cuga/sdk_core/tests/test_sdk_token_usage_tracker.py`:
- Around line 86-118: The test currently creates a local ActivityTracker and
compares it with the tracker inside TokenUsageTracker, but the SDK should expose
a shared tracker on the agent; update the test to use the agent's shared tracker
instead of creating a new one. Specifically, remove the local `tracker =
ActivityTracker()` and instead obtain the tracker from the agent (e.g., `tracker
= agent._activity_tracker` or after building callbacks assert
`token_tracker.tracker is agent._activity_tracker`), reset `tracker.prompts =
[]`, then proceed to call `tracker.collect_prompt(...)` and assert prompts; also
replace the identity assertion `token_tracker.tracker is tracker` with
`token_tracker.tracker is agent._activity_tracker` (or equivalent) to match the
fixed SDK behavior.
In `@src/cuga/sdk.py`:
- Around line 1346-1366: The ActivityTracker is recreated each time in
_build_callbacks(), preventing accumulation across calls; initialize
self._activity_tracker once in __init__ (create ActivityTracker() there) and
then change _build_callbacks() to use that instance (e.g.,
TokenUsageTracker(self._activity_tracker)) so all invocations from
_apply_callbacks(), invoke(), stream() and graph creation share the same
tracker; if needed, lazily create self._activity_tracker inside
_build_callbacks() only when it is None to preserve backward compatibility.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aeaf6429-3498-49e7-814c-ae256d66bb68
📒 Files selected for processing (5)
src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.pysrc/cuga/backend/llm/utils/helpers.pysrc/cuga/sdk.pysrc/cuga/sdk_core/tests/test_sdk_token_usage_tracker.pytests/unit/test_load_prompt_with_image_vision_gate.py
…n support When the global `advanced_features.use_vision = true` setting is on, `load_prompt_with_image` would unconditionally wrap the browser planner's human message as a list-typed multimodal `[image_url, text]` block. LLM endpoints that don't accept multimodal content (e.g. WatsonX served via an openai-compatible gateway) then reject the request with `400 messages[1].content must be a string`, killing hybrid-mode web-agent runs before the first step (issue #214). Add a per-model `enable_vision` flag (resolved by `model_supports_vision`, defaulting to `True` to preserve existing behavior for vision-capable models) and gate both the prompt template structure and the runtime image attachment on `use_vision AND model_supports_vision`. Operators whose LLM endpoint rejects multimodal content can set `enable_vision = false` on the relevant `agent.*.model` config to force string-only user content. Closes #214
819d5b5 to
9b7bad8
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
| """ | ||
| if model_config is None: | ||
| return True | ||
| return bool(getattr(model_config, 'enable_vision', True)) |
There was a problem hiding this comment.
so this checks if the enable_vision is given from model config? not like based on model e.g. claude 4.6 supports vision why do we need to add flag
There was a problem hiding this comment.
Good question — the short answer is the bug isn't actually about model capability; it's about endpoint capability.
The 400 in #214 came from a setup where the model itself does support vision (Llama 4 Maverick served via WatsonX, which is multimodal-capable), but the openai-compatible gateway in front of it doesn't accept multimodal content blocks — it rejects messages[1].content if it's a list. So a name-based allowlist ("claude-4.6 → multimodal OK") would have given the wrong answer here: it would have said yes, sent the list-typed payload, and produced the same 400.
The reason for the model name failing as a signal:
- Same model, multiple endpoints — e.g. WatsonX-via-openai-compatible vs WatsonX native. One accepts multimodal, the other doesn't.
- Provider names drift fast (litellm prefixes, openrouter aliases, custom
model_namestrings). A registry of "these names support vision" would need constant upkeep and would still miss the gateway dimension. - Silently downgrading vision when we think the model doesn't support it would mask configuration mistakes — e.g. someone pointing at the wrong endpoint and not noticing because we quietly stripped images.
The enable_vision flag defaults to True, so anyone whose endpoint already works keeps working with no change. The only people who need to set it are those with a non-multimodal gateway — and for them, an explicit opt-out is more discoverable than "why did my screenshots stop being sent?". Open to a follow-up that adds a name-based default if that's useful for the common case, but I'd want it to be additive (a hint, not authoritative) so we can still override per endpoint.
There was a problem hiding this comment.
so can we try catch error when image doesn't work and simply then try without
Bug fix
Fixes #214
Summary
When
advanced_features.use_vision = true(the default),load_prompt_with_imageunconditionally builds the browser planner's user message as a list-typed multimodal[image_url, text]block. LLM endpoints that don't accept multimodal content (e.g. WatsonX served via an openai-compatible gateway) then reject the request with400 messages[1].content must be a string, killing hybrid-mode web-agent runs before the first step.This change introduces a per-model
enable_visionflag (resolved by the newmodel_supports_visionhelper, defaulting toTrueto preserve existing behavior for vision-capable models). Both the prompt template structure (load_prompt_with_image) and the runtime image attachment (BrowserPlannerAgent.run) now gate onsettings.advanced_features.use_vision AND model_supports_vision(model_config). Operators whose LLM endpoint rejects multimodal content can setenable_vision = falseon the relevantagent.*.modelconfig (e.g.agent.planner.model) to force string-only user content.Why a per-model flag
The bug surfaces when the endpoint (not the model) cannot deserialize multimodal content blocks. Detecting endpoint capabilities at runtime is brittle, and silently downgrading vision would mask misconfigurations. An explicit
enable_vision = falseopt-out is small, discoverable, and additive — existing model configs continue to work unchanged because the default isTrue.Testing
Rebased onto
main(69b7e40a) — branch contains a single commit9b7bad82.New regression tests in
tests/unit/test_load_prompt_with_image_vision_gate.py(7 tests):model_supports_visiondefaults:Noneconfig and missing-attribute config both default toTrue; explicitenable_vision=Falseandenable_vision=Trueare honored.use_vision=True(the regression for [Bug]: Web agent in hybrid mode fails with "messages[1].content must be a string" (400 invalid_request_error) #214).image_url+textitems) when both global setting and per-model flag areTrue.Falseregardless of per-model flag.Test results (against this branch, rebased on
main)tests/unit/test_load_prompt_with_image_vision_gate.py(new)src/cuga/backend/tools_env/registry/tests/src/cuga/backend/cuga_graph/nodes/api/variables_manager/tests/src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/tests/(non-e2b)tests/unit/test_task_analyzer_app_matching.pytests/unit/test_cuga_lite_bind_tools.pysrc/cuga/backend/cuga_graph/nodes/cuga_lite/executors/tests/test_e2b_lite.pyRuntimeError: e2b-code-interpreter package not installed(env-only; CI runs these withuv sync --extra e2bpersrc/scripts/run_tests.sh)src/cuga/sdk_core/tests/(live LLM integration)ruff check(touched files)ruff format --check(touched files)--type committed --base main)Aggregate: 211 passing across the unit-test subset relevant to this change, with the only failures being env/network issues that also fail on
main.Summary by CodeRabbit
Bug Fixes
Tests