fix: Handle normalized coordinates from local grounding models (e.g. InternVL)#159
fix: Handle normalized coordinates from local grounding models (e.g. InternVL)#159buiilding wants to merge 1 commit intosimular-ai:mainfrom
Conversation
WalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
gui_agents/s3/agents/grounding.py (2)
244-249: Regex pattern may miss decimals without leading digits.The pattern
r"\d+\.?\d*"correctly captures most floating-point numbers but will not match numbers starting with a decimal point (e.g., ".456"). While most models output "0.456" format, consider usingr"\d+\.?\d*|\.\d+"for complete coverage.🔎 More robust regex pattern
- # Regex to find floating point numbers (0.xxxx) or integers - numericals = re.findall(r"\d+\.?\d*", response) + # Regex to find floating point numbers (0.xxxx, .xxxx) or integers + numericals = re.findall(r"\d+\.?\d*|\.\d+", response)
242-242: Consider using logger instead of print for debugging.The code uses
print()for debugging output. For consistency with the rest of the codebase (which useslogging.getLogger("desktopenv.agent")at line 16), consider using the logger instead.🔎 Proposed change
- print("RAW GROUNDING MODEL RESPONSE:", response) + logger.debug(f"RAW GROUNDING MODEL RESPONSE: {response}")
| # If coordinates are normalized (0-1), scale them up | ||
| if x <= 1.0 and y <= 1.0: | ||
| x = int(x * self.engine_params_for_grounding["grounding_width"]) | ||
| y = int(y * self.engine_params_for_grounding["grounding_height"]) | ||
| else: | ||
| x = int(x) | ||
| y = int(y) |
There was a problem hiding this comment.
Add defensive checks for required engine parameters.
The code directly accesses self.engine_params_for_grounding["grounding_width"] and ["grounding_height"] without verifying these keys exist. If these parameters are missing from the configuration, the code will raise a KeyError and crash the agent.
🔎 Proposed fix with defensive checks
# If coordinates are normalized (0-1), scale them up
if x <= 1.0 and y <= 1.0:
- x = int(x * self.engine_params_for_grounding["grounding_width"])
- y = int(y * self.engine_params_for_grounding["grounding_height"])
+ grounding_width = self.engine_params_for_grounding.get("grounding_width")
+ grounding_height = self.engine_params_for_grounding.get("grounding_height")
+ if grounding_width is None or grounding_height is None:
+ raise ValueError("grounding_width and grounding_height must be specified in engine_params_for_grounding when using normalized coordinates")
+ x = int(x * grounding_width)
+ y = int(y * grounding_height)
else:
x = int(x)
y = int(y)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # If coordinates are normalized (0-1), scale them up | |
| if x <= 1.0 and y <= 1.0: | |
| x = int(x * self.engine_params_for_grounding["grounding_width"]) | |
| y = int(y * self.engine_params_for_grounding["grounding_height"]) | |
| else: | |
| x = int(x) | |
| y = int(y) | |
| # If coordinates are normalized (0-1), scale them up | |
| if x <= 1.0 and y <= 1.0: | |
| grounding_width = self.engine_params_for_grounding.get("grounding_width") | |
| grounding_height = self.engine_params_for_grounding.get("grounding_height") | |
| if grounding_width is None or grounding_height is None: | |
| raise ValueError("grounding_width and grounding_height must be specified in engine_params_for_grounding when using normalized coordinates") | |
| x = int(x * grounding_width) | |
| y = int(y * grounding_height) | |
| else: | |
| x = int(x) | |
| y = int(y) |
🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 251 to 257, the code accesses
self.engine_params_for_grounding["grounding_width"] and ["grounding_height"]
directly which can raise KeyError; add defensive checks: verify the keys exist
and their values are numeric and >0 before using them (e.g., use dict.get with a
sensible default or explicitly raise a clear ValueError if missing/invalid),
convert to int only after validation, and ensure the fallback behavior for
normalized coordinates still makes sense when defaults are applied.
Consider edge cases in normalized coordinate detection.
The condition x <= 1.0 and y <= 1.0 may misidentify absolute coordinates as normalized in edge cases:
- Absolute coordinates like (0, 0), (1, 1), (0, 500), or (1, 768) where one or both values are ≤ 1.0 would trigger incorrect scaling
- Mixed coordinates (one normalized, one absolute) like (0.5, 500) would be treated as absolute, potentially causing incorrect clicks
While these edge cases are unlikely in typical usage, consider adding a more robust heuristic, such as checking if both coordinates are in the range [0.0, 1.0] and the model is known to output normalized coordinates (via a configuration flag).
🔎 Suggested improvement with configuration flag
Add a configuration parameter to explicitly indicate coordinate format:
# In __init__ or configuration
self.uses_normalized_coords = engine_params_for_grounding.get("normalized_coordinates", False)Then update the detection logic:
- # If coordinates are normalized (0-1), scale them up
- if x <= 1.0 and y <= 1.0:
+ # If coordinates are normalized (0-1), scale them up
+ should_scale = (x <= 1.0 and y <= 1.0) if self.uses_normalized_coords else (x < 1.0 and y < 1.0)
+ if should_scale:
x = int(x * self.engine_params_for_grounding["grounding_width"])
y = int(y * self.engine_params_for_grounding["grounding_height"])
else:Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In gui_agents/s3/agents/grounding.py around lines 251-257, the current check (x
<= 1.0 and y <= 1.0) can misclassify absolute coordinates as normalized; add a
configuration flag self.uses_normalized_coords =
engine_params_for_grounding.get("normalized_coordinates", False) (initialized in
__init__) and change the detection to only scale when
self.uses_normalized_coords is True and both x and y are within [0.0, 1.0];
otherwise treat as absolute coordinates (convert to int) and optionally emit a
debug/warning if one coordinate looks normalized and the other absolute to aid
debugging.
| y = float(numericals[1]) | ||
|
|
||
| # If coordinates are normalized (0-1), scale them up | ||
| if x <= 1.0 and y <= 1.0: |
There was a problem hiding this comment.
Thanks for the suggestion. Would you mind adding additional check of if both
grounding_width and grounding_height key exists in engine_params_for_grounding?
This avoids the edge case of either y or x == 1.0
Problem
When using local grounding models like
OpenGVLab/InternVL3_5-4Bvia vLLM, the model may return normalized coordinates (0.0 - 1.0) instead of absolute pixel coordinates.The previous implementation only extracted integers (
re.findall(r"\d+", response)), which caused two issues:0.4518became0and4518).ycoordinate often became a large integer (e.g.,4518), causingpyautoguito trigger aFailSafeExceptionby hitting the screen edge.Solution
Updated
gui_agents/s3/agents/grounding.pyto:re.findall(r"\d+\.?\d*", response).grounding_widthandgrounding_heightto get correct screen pixel values.Testing
Tested with
agent_susingvllmprovider andInternVL3_5-4B. The agent now correctly clicks on target elements instead of crashing withpyautogui.FailSafeException.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.