fix: improve coordinate precision with multi-sample voting#177
fix: improve coordinate precision with multi-sample voting#177Chase-Xuu wants to merge 5 commits intosimular-ai:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe grounding agent now adds structured coordinate parsing and normalization helpers, imports numpy, and updates Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller
participant Agent as OSWorldACI
participant Model as GroundingModel
participant Image as Image/Normalizer
Note over Agent,Model: n_samples >= 1
Caller->>Agent: generate_coords(ref_expr, obs, n_samples)
Agent->>Model: reset()/send_prompt(ref_expr, obs)
Model-->>Agent: raw_response (text with coords)
loop per sample (if n_samples > 1)
Agent->>Agent: _parse_coordinates(raw_response)
Agent->>Agent: collect valid (x,y)
end
Agent->>Agent: aggregate samples (numpy.median) if multiple
Agent->>Image: _normalize_coordinates(x,y)
Image-->>Agent: (x_pixel,y_pixel)
Agent-->>Caller: [x_pixel, y_pixel]
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@gui_agents/s3/agents/grounding.py`:
- Line 237: The regex used to extract numbers in the assignment to variable
"numericals" (re.findall(...)) won't match decimals missing a leading zero like
".45"; change the pattern to one that accepts optional leading digits and
optional sign and requires at least one digit overall (for example a pattern
equivalent to r"[-+]?(?:\d*\.\d+|\d+)") so that re.findall used in that line
will capture integers, decimals with or without a leading zero, and optional
signs.
- Around line 252-257: The coordinate-scaling block that checks "if x <= 1.0 and
y <= 1.0:" should be made stricter and the rounding simplified: change the
heuristic to "if x < 1.0 and y < 1.0:" to avoid treating tiny pixel coordinates
like (1,0) as normalized, scale with grounding_width/grounding_height when that
condition holds, and return rounded integer coordinates using round(x) and
round(y) (e.g., return (round(x), round(y))). Update references to the variables
x, y, grounding_width, grounding_height and the final return in the grounding
coordinate conversion code; optionally consider wiring an explicit
"is_normalized" flag from model/config later instead of relying on this
heuristic.
🧹 Nitpick comments (4)
gui_agents/s3/agents/grounding.py (4)
277-325: Code duplication between single-sample and multi-sample paths.The single-sample branch (lines 277–293) duplicates the prompt construction, model call, parsing, and normalization from the multi-sample branch (lines 298–311). The multi-sample path already handles
n_samples=1correctly (median of one sample = that sample), so the special case is unnecessary and increases maintenance burden.Suggested consolidation
- if n_samples <= 1: - # Original single-sample behavior for backward compatibility - self.grounding_model.reset() - prompt = f"Query:{ref_expr}\nOutput only the coordinate of one point in your response.\n" - self.grounding_model.add_message( - text_content=prompt, image_content=obs["screenshot"], put_text_last=True - ) - - response = call_llm_safe(self.grounding_model) - print("RAW GROUNDING MODEL RESPONSE:", response) - - coords = self._parse_coordinates(response) - if coords is None: - raise ValueError(f"Failed to parse coordinates from response: {response}") - - x, y = self._normalize_coordinates(coords[0], coords[1]) - return [x, y] - - # Multi-sample voting for improved accuracy + n_samples = max(1, n_samples) samples = []
286-286: Useloggerinstead ofprint()for debug output.The rest of the file uses the
loggerinstance (line 17). Theseprint()calls should uselogger.debug()orlogger.info()for consistency and to allow callers to control verbosity.Also applies to: 306-306, 323-323
440-440: Existing callers ofgenerate_coordsdon't passn_samples—the feature is opt-in only.All call sites (
click,type,drag_and_drop,scroll) use the defaultn_samples=1. This is fine for backward compatibility, but means no caller currently benefits from the new feature. Consider whether a class-level default (e.g.,self.default_n_samplesset in__init__) would be more practical, letting users configure it once rather than modifying every action method.Also applies to: 519-519, 563-564, 692-692
6-6: Replacenp.medianwithstatistics.medianto avoid unnecessary import.The file only uses
numpyfornp.median()at lines 320-321. Python's stdlibstatistics.medianhandles this identically and removes the import:from statistics import median final_x = int(round(median(x_coords))) final_y = int(round(median(y_coords)))While numpy is a direct project dependency, this change improves code clarity by removing an unnecessary import from this specific module.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@ANALYSIS.md`:
- Around line 1-147: The ANALYSIS.md planning/analysis document should not be
committed to the repo root; remove the file from the PR (or revert the commit
that added it) and move its content into the PR description or a linked GitHub
issue instead, then update the branch/commit to reflect the removal; also fix
the fenced code block inside that document by adding the appropriate language
specifier (e.g., ```python or ```text) so markdownlint MD040 is resolved.
In `@grounding_improved.py`:
- Around line 1-190: This prototype file should be removed from the PR: delete
grounding_improved.py because it duplicates functionality already merged into
gui_agents/s3/agents/grounding.py and contains multiple defects (unbound free
functions generate_coords and generate_coords_with_ocr_fallback that take self,
use of undefined call_llm_safe, a deficient regex in _parse_coordinates that
misses “.45” style numbers, and an unused unpacked variable ocr_table); remove
the file from the commit/PR (or move only a cleaned, vetted subset into the
canonical grounding.py), or if you intend to keep any logic, integrate/fix it
inside gui_agents/s3/agents/grounding.py by binding methods to the agent class,
importing/using call_llm_safe, updating the regex to include |\.\d+, and
removing the unused ocr_table variable.
In `@gui_agents/s3/agents/grounding.py`:
- Line 260: The generate_coords(ref_expr: str, obs: Dict, n_samples: int = 1)
function’s n_samples argument is never exercised because all callers (click,
type, drag_and_drop, scroll) call it with only ref_expr and obs; fix by wiring
n_samples through: either add an instance attribute on OSWorldACI (e.g.,
self.n_samples set from CLI/config) and have each caller call
generate_coords(..., n_samples=self.n_samples), or add an explicit n_samples
parameter to the caller methods (click, type, drag_and_drop, scroll) and forward
that value into generate_coords; update the OSWorldACI constructor or call sites
to source the configured sample count so multi-sample voting is actually used.
In `@README_CONTRIBUTION.md`:
- Around line 1-101: README_CONTRIBUTION.md contains personal notes and PII and
must be removed from the PR; delete the file from the branch (git rm
README_CONTRIBUTION.md && commit) so it is not included in the PR, ensure no
other commits reintroduce it, and if you need local notes keep them outside the
repo or add the filename to .gitignore/history cleanup if necessary; confirm the
PR diff no longer lists README_CONTRIBUTION.md.
🧹 Nitpick comments (1)
gui_agents/s3/agents/grounding.py (1)
286-286: Useloggerinstead ofprint()for grounding debug output.The file already sets up
logger = logging.getLogger("desktopenv.agent")at Line 17, but the new code uses rawprint()calls for debug output (Lines 286, 306, 323). This is inconsistent with the rest of the codebase and makes it impossible to control log verbosity.Proposed fix
- print("RAW GROUNDING MODEL RESPONSE:", response) + logger.debug("RAW GROUNDING MODEL RESPONSE: %s", response)- print(f"GROUNDING SAMPLE {i+1}/{n_samples}: {response}") + logger.debug("GROUNDING SAMPLE %d/%d: %s", i + 1, n_samples, response)- print(f"MULTI-SAMPLE RESULT: samples={samples}, median=({final_x}, {final_y})") + logger.debug("MULTI-SAMPLE RESULT: samples=%s, median=(%d, %d)", samples, final_x, final_y)Also applies to: 306-306, 323-323
| assert len(numericals) >= 2 | ||
| return [int(numericals[0]), int(numericals[1])] | ||
| # Given the state and worker's referring expression, use the grounding model to generate (x,y) | ||
| def generate_coords(self, ref_expr: str, obs: Dict, n_samples: int = 1) -> List[int]: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd gui_agents/s3/agents && rg -n 'generate_coords\(' --type=py -A 1Repository: simular-ai/Agent-S
Length of output: 914
🏁 Script executed:
cd gui_agents/s3/agents && rg -n 'def (click|type|drag_and_drop|scroll)\(' grounding.pyRepository: simular-ai/Agent-S
Length of output: 211
🏁 Script executed:
cd gui_agents && rg -n 'generate_coords\(' --type=pyRepository: simular-ai/Agent-S
Length of output: 1403
n_samples parameter is unreachable—all callers use the default value.
All five callers of generate_coords() in this file (click at line 440, type at line 519, drag_and_drop at lines 563–564, and scroll at line 692) pass only ref_expr and obs arguments, always defaulting to n_samples=1. The multi-sample voting feature cannot be exercised without wiring the parameter to a configurable source.
Consider either:
- Storing
n_samplesas an instance attribute onOSWorldACI(set from CLI args / config) and using it in each caller, or - Passing
n_samplesthrough from the callers.
🤖 Prompt for AI Agents
In `@gui_agents/s3/agents/grounding.py` at line 260, The generate_coords(ref_expr:
str, obs: Dict, n_samples: int = 1) function’s n_samples argument is never
exercised because all callers (click, type, drag_and_drop, scroll) call it with
only ref_expr and obs; fix by wiring n_samples through: either add an instance
attribute on OSWorldACI (e.g., self.n_samples set from CLI/config) and have each
caller call generate_coords(..., n_samples=self.n_samples), or add an explicit
n_samples parameter to the caller methods (click, type, drag_and_drop, scroll)
and forward that value into generate_coords; update the OSWorldACI constructor
or call sites to source the configured sample count so multi-sample voting is
actually used.
Addresses simular-ai#175 - coordinate offset problem for small buttons. Changes: - Add n_samples parameter to generate_coords() for multi-sample voting - Extract coordinate parsing into _parse_coordinates() helper - Handle both integer and normalized (0-1) coordinate formats - Use median for robustness to outliers - Fully backward compatible (n_samples=1 by default) The multi-sample approach reduces variance by running the grounding model multiple times and taking the median, which significantly improves click accuracy for small UI elements.
Address review feedback: regex now matches numbers like '.45' in addition to '0.45' and integers.
These files were accidentally committed - they're personal notes and prototype code that don't belong in the repository.
ce98677 to
638a56a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@gui_agents/s3/agents/grounding.py`:
- Around line 298-311: The multi-sample loop in the grounding routine is
ineffective because call_llm_safe defaults to temperature=0.0; update the loop
that uses self.grounding_model and call_llm_safe so multi-sampling actually
varies outputs by supplying a non-zero temperature (e.g., 0.3–0.7) when invoking
the model, or alternatively call the grounding model's own get_response /
inference method with a non-zero temperature; keep the rest of the logic that
parses coordinates via _parse_coordinates and normalizes them with
_normalize_coordinates and appends to samples unchanged.
🧹 Nitpick comments (4)
gui_agents/s3/agents/grounding.py (4)
277-325: Significant code duplication between single-sample and multi-sample paths.The single-sample path (lines 277–293) is essentially the body of the multi-sample loop (lines 298–311). Consider unifying them — the single-sample case is just
n_samples=1with the median of one sample being itself.Sketch of unified approach
- if n_samples <= 1: - # Original single-sample behavior for backward compatibility - self.grounding_model.reset() - prompt = f"Query:{ref_expr}\nOutput only the coordinate of one point in your response.\n" - self.grounding_model.add_message( - text_content=prompt, image_content=obs["screenshot"], put_text_last=True - ) - - response = call_llm_safe(self.grounding_model) - print("RAW GROUNDING MODEL RESPONSE:", response) - - coords = self._parse_coordinates(response) - if coords is None: - raise ValueError(f"Failed to parse coordinates from response: {response}") - - x, y = self._normalize_coordinates(coords[0], coords[1]) - return [x, y] - - # Multi-sample voting for improved accuracy + n_samples = max(1, n_samples) samples = [] - + for i in range(n_samples): self.grounding_model.reset() prompt = f"Query:{ref_expr}\nOutput only the coordinate of one point in your response.\n" ... - + coords = self._parse_coordinates(response) if coords is not None: x, y = self._normalize_coordinates(coords[0], coords[1]) samples.append((x, y)) - + if not samples: raise ValueError(f"Failed to get valid coordinates after {n_samples} attempts") - + x_coords = [s[0] for s in samples] y_coords = [s[1] for s in samples] - - final_x = int(round(np.median(x_coords))) - final_y = int(round(np.median(y_coords))) + final_x = round(float(np.median(x_coords))) + final_y = round(float(np.median(y_coords))) return [final_x, final_y]
6-6: Consider usingstatistics.medianinstead of adding anumpydependency.
numpyis a heavy dependency imported here solely fornp.medianon a tiny list (3–5 elements). Python's stdlibstatistics.mediandoes the same thing without the extra dependency.-import numpy as np +from statistics import medianThen replace
np.median(x_coords)→median(x_coords).
320-321:int(round(...))is redundant —round()already returnsintin Python 3.Same static analysis hint (RUF046) applies here.
round(np.median(x_coords))suffices, thoughnp.medianreturns anumpy.float64— you may wantround(float(np.median(...)))for a clean Pythonint.
286-286: Uselogger.debug()instead ofprint()for debug output.The file already configures a logger (line 17). The
print()calls on lines 286, 306, and 323 should uselogger.debug()for consistency and to allow log-level control in production.Also applies to: 306-306, 323-323
- Use non-zero temperature (0.3) for multi-sample voting to produce variance across samples (fixes CodeRabbit critical feedback) - Remove redundant int() wrapper around round() in _normalize_coordinates since round() returns int in Python 3
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@gui_agents/s3/agents/grounding.py`:
- Around line 229-243: The _parse_coordinates method currently picks the first
two numeric tokens which breaks when the model prefixes answers (e.g., "Point 1:
(523, 847)"); update _parse_coordinates to extract coordinates by preferring the
last two numeric tokens (or, better, by first searching for a parenthesized pair
like "(x, y)" and falling back to the last two numbers) so that functions like
_parse_coordinates use the final numeric pair instead of the first two; adjust
the parsing logic around the numericals extraction and the assignment to x,y
accordingly.
- Around line 295-312: The loop is normalizing (rounding) each sample via
_normalize_coordinates before aggregation, losing sub-pixel precision; change it
to append the raw float coords returned by _parse_coordinates into samples
(avoid calling _normalize_coordinates inside the loop), then after computing the
median of samples call _normalize_coordinates once on the median (or final
aggregated float coords) to produce the rounded output; update references in the
loop that currently call _normalize_coordinates and ensure downstream code still
expects the normalized result.
🧹 Nitpick comments (2)
gui_agents/s3/agents/grounding.py (2)
286-286: Useloggerinstead ofprint()for debug output.Lines 286, 307, and 324 use
print()while the rest of the file uses theloggerinstance (line 17). Usingprint()bypasses log-level filtering and log routing.Proposed fix
- print("RAW GROUNDING MODEL RESPONSE:", response) + logger.debug("RAW GROUNDING MODEL RESPONSE: %s", response)- print(f"GROUNDING SAMPLE {i+1}/{n_samples}: {response}") + logger.debug("GROUNDING SAMPLE %d/%d: %s", i + 1, n_samples, response)- print(f"MULTI-SAMPLE RESULT: samples={samples}, median=({final_x}, {final_y})") + logger.debug("MULTI-SAMPLE RESULT: samples=%s, median=(%d, %d)", samples, final_x, final_y)Also applies to: 307-307, 324-324
321-322:int(round(...))is redundant —round()already returnsintin Python 3.Static analysis (Ruff RUF046) flags this.
np.medianreturns a float, butround()on it yields anint.Proposed fix
- final_x = int(round(np.median(x_coords))) - final_y = int(round(np.median(y_coords))) + final_x = round(np.median(x_coords)) + final_y = round(np.median(y_coords))
- Parse coords using last two numbers (or parenthesized pair) to handle model prefixes like 'Point 1: (523, 847)' - Use < 1.0 heuristic instead of <= 1.0 to avoid treating tiny pixel coords as normalized - Preserve sub-pixel precision: aggregate raw floats in multi-sample loop, normalize once at end - Replace print() with logger.debug() for consistent logging
|
Hi @coderabbitai! Thanks for the detailed review. I wanted to clarify that the latest commit (234d60f) already addresses the issues you raised:
Let me know if you'd like any additional changes! |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 32330
The implementation looks solid! The multi-sample voting approach with median aggregation should provide robust coordinate prediction for small UI elements while maintaining backward compatibility with the default |
Summary
Fixes #175 - Improves coordinate accuracy for small UI elements by implementing multi-sample voting.
Problem
When using Agent-S to click on small UI elements (icons, checkboxes, small buttons), the grounding model's coordinate predictions can be slightly off, causing click failures. This is due to:
Solution
This PR introduces multi-sample voting for coordinate generation:
n_samplesparameter togenerate_coords()(default=1 for backward compatibility)n_samples > 1, runs the grounding model multiple times and takes the median of predictionsChanges
gui_agents/s3/agents/grounding.py:_parse_coordinates()helper for robust coordinate extraction_normalize_coordinates()to handle 0-1 normalized coordsgenerate_coords()with optionaln_samplesparameterUsage
Trade-offs
Testing
Related
Summary by CodeRabbit
New Features
Improvements