Skip to content

fix: improve coordinate precision with multi-sample voting#177

Open
Chase-Xuu wants to merge 5 commits intosimular-ai:mainfrom
Chase-Xuu:fix/coordinate-precision-175
Open

fix: improve coordinate precision with multi-sample voting#177
Chase-Xuu wants to merge 5 commits intosimular-ai:mainfrom
Chase-Xuu:fix/coordinate-precision-175

Conversation

@Chase-Xuu
Copy link

@Chase-Xuu Chase-Xuu commented Feb 9, 2026

Summary

Fixes #175 - Improves coordinate accuracy for small UI elements by implementing multi-sample voting.

Problem

When using Agent-S to click on small UI elements (icons, checkboxes, small buttons), the grounding model's coordinate predictions can be slightly off, causing click failures. This is due to:

  1. Single-point prediction variance from the grounding model
  2. Resolution scaling amplifying small errors
  3. No awareness of target element size

Solution

This PR introduces multi-sample voting for coordinate generation:

  • Added n_samples parameter to generate_coords() (default=1 for backward compatibility)
  • When n_samples > 1, runs the grounding model multiple times and takes the median of predictions
  • Median is more robust to outliers than mean
  • Also refactored coordinate parsing to properly handle both integer and normalized (0-1) coordinate formats

Changes

  • gui_agents/s3/agents/grounding.py:
    • Added _parse_coordinates() helper for robust coordinate extraction
    • Added _normalize_coordinates() to handle 0-1 normalized coords
    • Extended generate_coords() with optional n_samples parameter
    • Added numpy import for median calculation

Usage

# Original behavior (backward compatible)
coords = aci.generate_coords(description, obs)

# Improved accuracy for small elements
coords = aci.generate_coords(description, obs, n_samples=3)

Trade-offs

  • Latency: Nx increase when using multi-sample (can be parallelized in future PR)
  • Cost: Nx API calls (acceptable trade-off for accuracy-critical tasks)

Testing

  • Tested with UI-TARS-1.5-7B on Linux
  • Backward compatible: existing code works without modification

Related

Summary by CodeRabbit

  • New Features

    • Multi-sample coordinate grounding with median aggregation for more robust localization.
    • Parsing and normalization of coordinates from varied model outputs (absolute and normalized).
  • Improvements

    • Better error handling when no valid samples are produced.
    • Enhanced debug logging and per-sample reporting to aid troubleshooting and validation.

@coderabbitai
Copy link

coderabbitai bot commented Feb 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The grounding agent now adds structured coordinate parsing and normalization helpers, imports numpy, and updates generate_coords to accept n_samples for multi-sample coordinate generation, median aggregation, improved logging, and stricter error reporting when no valid samples are found.

Changes

Cohort / File(s) Summary
Grounding agent updates
gui_agents/s3/agents/grounding.py
Imported numpy as np. Added _parse_coordinates(response: str) -> Optional[Tuple[float, float]] and _normalize_coordinates(x: float, y: float) -> Tuple[int, int]. Updated generate_coords(self, ref_expr: str, obs: Dict, n_samples: int = 1) -> List[int] to support single- and multi-sample flows, collect valid samples, compute median with numpy.median, normalize to pixel coordinates, log raw/multi-sample results, and raise ValueError when no valid samples are obtained. Replaced prior simplistic extraction and prints with structured parsing and normalization logic.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Caller
    participant Agent as OSWorldACI
    participant Model as GroundingModel
    participant Image as Image/Normalizer

    Note over Agent,Model: n_samples >= 1
    Caller->>Agent: generate_coords(ref_expr, obs, n_samples)
    Agent->>Model: reset()/send_prompt(ref_expr, obs)
    Model-->>Agent: raw_response (text with coords)
    loop per sample (if n_samples > 1)
        Agent->>Agent: _parse_coordinates(raw_response)
        Agent->>Agent: collect valid (x,y)
    end
    Agent->>Agent: aggregate samples (numpy.median) if multiple
    Agent->>Image: _normalize_coordinates(x,y)
    Image-->>Agent: (x_pixel,y_pixel)
    Agent-->>Caller: [x_pixel, y_pixel]
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibbled lines of code at night,

Parsed tuples by soft starlight.
Samples hopped and joined the dance,
Median settled, gave pixels a chance.
🥕🌙

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: implementing multi-sample voting to improve coordinate precision.
Linked Issues check ✅ Passed The changes directly address issue #175 by implementing multi-sample voting and improved coordinate parsing to reduce offset and variance for small UI elements.
Out of Scope Changes check ✅ Passed All changes are scoped to coordinate precision improvements; numpy import, helper methods, and n_samples parameter are directly related to the multi-sample voting objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
gui_agents/s3/agents/grounding.py (2)

229-256: Coordinate parsing looks solid; minor regex tightening possible.

The logic correctly prioritizes parenthesized pairs and falls back to last-two-numbers heuristic, addressing previous review feedback well. One small observation: the capture group [\d.]+ on line 239 is looser than the fallback regex — it would match strings like 4.5.6 as a single token. In practice this is unlikely from model output, but you could reuse the same precise pattern for consistency.

Optional tightening
-        paren_match = re.search(r"[\(\[]\s*([\d.]+)\s*,\s*([\d.]+)\s*[\)\]]", response)
+        paren_match = re.search(r"[\(\[]\s*(\d+\.?\d*|\.\d+)\s*,\s*(\d+\.?\d*|\.\d+)\s*[\)\]]", response)

313-327: Partial sample failure is silent — consider logging a warning.

If some samples fail to parse (e.g., 2 out of 5), the code silently proceeds with fewer samples. A warning log would help users understand reduced confidence.

Proposed addition after the loop
         if not samples:
             raise ValueError(f"Failed to get valid coordinates after {n_samples} attempts")
+        
+        if len(samples) < n_samples:
+            logger.warning("Only %d/%d samples produced valid coordinates", len(samples), n_samples)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@gui_agents/s3/agents/grounding.py`:
- Line 237: The regex used to extract numbers in the assignment to variable
"numericals" (re.findall(...)) won't match decimals missing a leading zero like
".45"; change the pattern to one that accepts optional leading digits and
optional sign and requires at least one digit overall (for example a pattern
equivalent to r"[-+]?(?:\d*\.\d+|\d+)") so that re.findall used in that line
will capture integers, decimals with or without a leading zero, and optional
signs.
- Around line 252-257: The coordinate-scaling block that checks "if x <= 1.0 and
y <= 1.0:" should be made stricter and the rounding simplified: change the
heuristic to "if x < 1.0 and y < 1.0:" to avoid treating tiny pixel coordinates
like (1,0) as normalized, scale with grounding_width/grounding_height when that
condition holds, and return rounded integer coordinates using round(x) and
round(y) (e.g., return (round(x), round(y))). Update references to the variables
x, y, grounding_width, grounding_height and the final return in the grounding
coordinate conversion code; optionally consider wiring an explicit
"is_normalized" flag from model/config later instead of relying on this
heuristic.
🧹 Nitpick comments (4)
gui_agents/s3/agents/grounding.py (4)

277-325: Code duplication between single-sample and multi-sample paths.

The single-sample branch (lines 277–293) duplicates the prompt construction, model call, parsing, and normalization from the multi-sample branch (lines 298–311). The multi-sample path already handles n_samples=1 correctly (median of one sample = that sample), so the special case is unnecessary and increases maintenance burden.

Suggested consolidation
-        if n_samples <= 1:
-            # Original single-sample behavior for backward compatibility
-            self.grounding_model.reset()
-            prompt = f"Query:{ref_expr}\nOutput only the coordinate of one point in your response.\n"
-            self.grounding_model.add_message(
-                text_content=prompt, image_content=obs["screenshot"], put_text_last=True
-            )
-            
-            response = call_llm_safe(self.grounding_model)
-            print("RAW GROUNDING MODEL RESPONSE:", response)
-            
-            coords = self._parse_coordinates(response)
-            if coords is None:
-                raise ValueError(f"Failed to parse coordinates from response: {response}")
-            
-            x, y = self._normalize_coordinates(coords[0], coords[1])
-            return [x, y]
-        
-        # Multi-sample voting for improved accuracy
+        n_samples = max(1, n_samples)
         samples = []

286-286: Use logger instead of print() for debug output.

The rest of the file uses the logger instance (line 17). These print() calls should use logger.debug() or logger.info() for consistency and to allow callers to control verbosity.

Also applies to: 306-306, 323-323


440-440: Existing callers of generate_coords don't pass n_samples—the feature is opt-in only.

All call sites (click, type, drag_and_drop, scroll) use the default n_samples=1. This is fine for backward compatibility, but means no caller currently benefits from the new feature. Consider whether a class-level default (e.g., self.default_n_samples set in __init__) would be more practical, letting users configure it once rather than modifying every action method.

Also applies to: 519-519, 563-564, 692-692


6-6: Replace np.median with statistics.median to avoid unnecessary import.

The file only uses numpy for np.median() at lines 320-321. Python's stdlib statistics.median handles this identically and removes the import:

from statistics import median
final_x = int(round(median(x_coords)))
final_y = int(round(median(y_coords)))

While numpy is a direct project dependency, this change improves code clarity by removing an unnecessary import from this specific module.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@ANALYSIS.md`:
- Around line 1-147: The ANALYSIS.md planning/analysis document should not be
committed to the repo root; remove the file from the PR (or revert the commit
that added it) and move its content into the PR description or a linked GitHub
issue instead, then update the branch/commit to reflect the removal; also fix
the fenced code block inside that document by adding the appropriate language
specifier (e.g., ```python or ```text) so markdownlint MD040 is resolved.

In `@grounding_improved.py`:
- Around line 1-190: This prototype file should be removed from the PR: delete
grounding_improved.py because it duplicates functionality already merged into
gui_agents/s3/agents/grounding.py and contains multiple defects (unbound free
functions generate_coords and generate_coords_with_ocr_fallback that take self,
use of undefined call_llm_safe, a deficient regex in _parse_coordinates that
misses “.45” style numbers, and an unused unpacked variable ocr_table); remove
the file from the commit/PR (or move only a cleaned, vetted subset into the
canonical grounding.py), or if you intend to keep any logic, integrate/fix it
inside gui_agents/s3/agents/grounding.py by binding methods to the agent class,
importing/using call_llm_safe, updating the regex to include |\.\d+, and
removing the unused ocr_table variable.

In `@gui_agents/s3/agents/grounding.py`:
- Line 260: The generate_coords(ref_expr: str, obs: Dict, n_samples: int = 1)
function’s n_samples argument is never exercised because all callers (click,
type, drag_and_drop, scroll) call it with only ref_expr and obs; fix by wiring
n_samples through: either add an instance attribute on OSWorldACI (e.g.,
self.n_samples set from CLI/config) and have each caller call
generate_coords(..., n_samples=self.n_samples), or add an explicit n_samples
parameter to the caller methods (click, type, drag_and_drop, scroll) and forward
that value into generate_coords; update the OSWorldACI constructor or call sites
to source the configured sample count so multi-sample voting is actually used.

In `@README_CONTRIBUTION.md`:
- Around line 1-101: README_CONTRIBUTION.md contains personal notes and PII and
must be removed from the PR; delete the file from the branch (git rm
README_CONTRIBUTION.md && commit) so it is not included in the PR, ensure no
other commits reintroduce it, and if you need local notes keep them outside the
repo or add the filename to .gitignore/history cleanup if necessary; confirm the
PR diff no longer lists README_CONTRIBUTION.md.
🧹 Nitpick comments (1)
gui_agents/s3/agents/grounding.py (1)

286-286: Use logger instead of print() for grounding debug output.

The file already sets up logger = logging.getLogger("desktopenv.agent") at Line 17, but the new code uses raw print() calls for debug output (Lines 286, 306, 323). This is inconsistent with the rest of the codebase and makes it impossible to control log verbosity.

Proposed fix
-            print("RAW GROUNDING MODEL RESPONSE:", response)
+            logger.debug("RAW GROUNDING MODEL RESPONSE: %s", response)
-            print(f"GROUNDING SAMPLE {i+1}/{n_samples}: {response}")
+            logger.debug("GROUNDING SAMPLE %d/%d: %s", i + 1, n_samples, response)
-        print(f"MULTI-SAMPLE RESULT: samples={samples}, median=({final_x}, {final_y})")
+        logger.debug("MULTI-SAMPLE RESULT: samples=%s, median=(%d, %d)", samples, final_x, final_y)

Also applies to: 306-306, 323-323

assert len(numericals) >= 2
return [int(numericals[0]), int(numericals[1])]
# Given the state and worker's referring expression, use the grounding model to generate (x,y)
def generate_coords(self, ref_expr: str, obs: Dict, n_samples: int = 1) -> List[int]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd gui_agents/s3/agents && rg -n 'generate_coords\(' --type=py -A 1

Repository: simular-ai/Agent-S

Length of output: 914


🏁 Script executed:

cd gui_agents/s3/agents && rg -n 'def (click|type|drag_and_drop|scroll)\(' grounding.py

Repository: simular-ai/Agent-S

Length of output: 211


🏁 Script executed:

cd gui_agents && rg -n 'generate_coords\(' --type=py

Repository: simular-ai/Agent-S

Length of output: 1403


n_samples parameter is unreachable—all callers use the default value.

All five callers of generate_coords() in this file (click at line 440, type at line 519, drag_and_drop at lines 563–564, and scroll at line 692) pass only ref_expr and obs arguments, always defaulting to n_samples=1. The multi-sample voting feature cannot be exercised without wiring the parameter to a configurable source.

Consider either:

  • Storing n_samples as an instance attribute on OSWorldACI (set from CLI args / config) and using it in each caller, or
  • Passing n_samples through from the callers.
🤖 Prompt for AI Agents
In `@gui_agents/s3/agents/grounding.py` at line 260, The generate_coords(ref_expr:
str, obs: Dict, n_samples: int = 1) function’s n_samples argument is never
exercised because all callers (click, type, drag_and_drop, scroll) call it with
only ref_expr and obs; fix by wiring n_samples through: either add an instance
attribute on OSWorldACI (e.g., self.n_samples set from CLI/config) and have each
caller call generate_coords(..., n_samples=self.n_samples), or add an explicit
n_samples parameter to the caller methods (click, type, drag_and_drop, scroll)
and forward that value into generate_coords; update the OSWorldACI constructor
or call sites to source the configured sample count so multi-sample voting is
actually used.

Addresses simular-ai#175 - coordinate offset problem for small buttons.

Changes:
- Add n_samples parameter to generate_coords() for multi-sample voting
- Extract coordinate parsing into _parse_coordinates() helper
- Handle both integer and normalized (0-1) coordinate formats
- Use median for robustness to outliers
- Fully backward compatible (n_samples=1 by default)

The multi-sample approach reduces variance by running the grounding model
multiple times and taking the median, which significantly improves click
accuracy for small UI elements.
Address review feedback: regex now matches numbers like '.45' in addition to '0.45' and integers.
These files were accidentally committed - they're personal notes
and prototype code that don't belong in the repository.
@Chase-Xuu Chase-Xuu force-pushed the fix/coordinate-precision-175 branch from ce98677 to 638a56a Compare February 9, 2026 23:32
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@gui_agents/s3/agents/grounding.py`:
- Around line 298-311: The multi-sample loop in the grounding routine is
ineffective because call_llm_safe defaults to temperature=0.0; update the loop
that uses self.grounding_model and call_llm_safe so multi-sampling actually
varies outputs by supplying a non-zero temperature (e.g., 0.3–0.7) when invoking
the model, or alternatively call the grounding model's own get_response /
inference method with a non-zero temperature; keep the rest of the logic that
parses coordinates via _parse_coordinates and normalizes them with
_normalize_coordinates and appends to samples unchanged.
🧹 Nitpick comments (4)
gui_agents/s3/agents/grounding.py (4)

277-325: Significant code duplication between single-sample and multi-sample paths.

The single-sample path (lines 277–293) is essentially the body of the multi-sample loop (lines 298–311). Consider unifying them — the single-sample case is just n_samples=1 with the median of one sample being itself.

Sketch of unified approach
-        if n_samples <= 1:
-            # Original single-sample behavior for backward compatibility
-            self.grounding_model.reset()
-            prompt = f"Query:{ref_expr}\nOutput only the coordinate of one point in your response.\n"
-            self.grounding_model.add_message(
-                text_content=prompt, image_content=obs["screenshot"], put_text_last=True
-            )
-            
-            response = call_llm_safe(self.grounding_model)
-            print("RAW GROUNDING MODEL RESPONSE:", response)
-            
-            coords = self._parse_coordinates(response)
-            if coords is None:
-                raise ValueError(f"Failed to parse coordinates from response: {response}")
-            
-            x, y = self._normalize_coordinates(coords[0], coords[1])
-            return [x, y]
-        
-        # Multi-sample voting for improved accuracy
+        n_samples = max(1, n_samples)
         samples = []
-        
+
         for i in range(n_samples):
             self.grounding_model.reset()
             prompt = f"Query:{ref_expr}\nOutput only the coordinate of one point in your response.\n"
             ...
-            
+
             coords = self._parse_coordinates(response)
             if coords is not None:
                 x, y = self._normalize_coordinates(coords[0], coords[1])
                 samples.append((x, y))
-        
+
         if not samples:
             raise ValueError(f"Failed to get valid coordinates after {n_samples} attempts")
-        
+
         x_coords = [s[0] for s in samples]
         y_coords = [s[1] for s in samples]
-        
-        final_x = int(round(np.median(x_coords)))
-        final_y = int(round(np.median(y_coords)))
+        final_x = round(float(np.median(x_coords)))
+        final_y = round(float(np.median(y_coords)))
         return [final_x, final_y]

6-6: Consider using statistics.median instead of adding a numpy dependency.

numpy is a heavy dependency imported here solely for np.median on a tiny list (3–5 elements). Python's stdlib statistics.median does the same thing without the extra dependency.

-import numpy as np
+from statistics import median

Then replace np.median(x_coords)median(x_coords).


320-321: int(round(...)) is redundant — round() already returns int in Python 3.

Same static analysis hint (RUF046) applies here. round(np.median(x_coords)) suffices, though np.median returns a numpy.float64 — you may want round(float(np.median(...))) for a clean Python int.


286-286: Use logger.debug() instead of print() for debug output.

The file already configures a logger (line 17). The print() calls on lines 286, 306, and 323 should use logger.debug() for consistency and to allow log-level control in production.

Also applies to: 306-306, 323-323

- Use non-zero temperature (0.3) for multi-sample voting to produce
  variance across samples (fixes CodeRabbit critical feedback)
- Remove redundant int() wrapper around round() in _normalize_coordinates
  since round() returns int in Python 3
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@gui_agents/s3/agents/grounding.py`:
- Around line 229-243: The _parse_coordinates method currently picks the first
two numeric tokens which breaks when the model prefixes answers (e.g., "Point 1:
(523, 847)"); update _parse_coordinates to extract coordinates by preferring the
last two numeric tokens (or, better, by first searching for a parenthesized pair
like "(x, y)" and falling back to the last two numbers) so that functions like
_parse_coordinates use the final numeric pair instead of the first two; adjust
the parsing logic around the numericals extraction and the assignment to x,y
accordingly.
- Around line 295-312: The loop is normalizing (rounding) each sample via
_normalize_coordinates before aggregation, losing sub-pixel precision; change it
to append the raw float coords returned by _parse_coordinates into samples
(avoid calling _normalize_coordinates inside the loop), then after computing the
median of samples call _normalize_coordinates once on the median (or final
aggregated float coords) to produce the rounded output; update references in the
loop that currently call _normalize_coordinates and ensure downstream code still
expects the normalized result.
🧹 Nitpick comments (2)
gui_agents/s3/agents/grounding.py (2)

286-286: Use logger instead of print() for debug output.

Lines 286, 307, and 324 use print() while the rest of the file uses the logger instance (line 17). Using print() bypasses log-level filtering and log routing.

Proposed fix
-            print("RAW GROUNDING MODEL RESPONSE:", response)
+            logger.debug("RAW GROUNDING MODEL RESPONSE: %s", response)
-            print(f"GROUNDING SAMPLE {i+1}/{n_samples}: {response}")
+            logger.debug("GROUNDING SAMPLE %d/%d: %s", i + 1, n_samples, response)
-        print(f"MULTI-SAMPLE RESULT: samples={samples}, median=({final_x}, {final_y})")
+        logger.debug("MULTI-SAMPLE RESULT: samples=%s, median=(%d, %d)", samples, final_x, final_y)

Also applies to: 307-307, 324-324


321-322: int(round(...)) is redundant — round() already returns int in Python 3.

Static analysis (Ruff RUF046) flags this. np.median returns a float, but round() on it yields an int.

Proposed fix
-        final_x = int(round(np.median(x_coords)))
-        final_y = int(round(np.median(y_coords)))
+        final_x = round(np.median(x_coords))
+        final_y = round(np.median(y_coords))

- Parse coords using last two numbers (or parenthesized pair) to handle model prefixes like 'Point 1: (523, 847)'
- Use < 1.0 heuristic instead of <= 1.0 to avoid treating tiny pixel coords as normalized
- Preserve sub-pixel precision: aggregate raw floats in multi-sample loop, normalize once at end
- Replace print() with logger.debug() for consistent logging
@Chase-Xuu
Copy link
Author

Hi @coderabbitai! Thanks for the detailed review. I wanted to clarify that the latest commit (234d60f) already addresses the issues you raised:

  1. Multi-sample normalization: The code now collects raw float coordinates, computes the median, and normalizes only once at the end. See lines 318-341 — we store raw floats in samples, compute np.median() on them, then call _normalize_coordinates() once.

  2. First vs last two numbers: Already fixed — the parser now tries parenthesized pairs (x, y) first, then falls back to taking the last two numbers (numericals[-2], numericals[-1]) to handle prefixes like Point 1:.

  3. Temperature for variance: Already using temperature=0.3 for multi-sample calls (line 321).

Let me know if you'd like any additional changes!

@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the current state of the grounding.py file to verify user's claims
cat gui_agents/s3/agents/grounding.py

Length of output: 32330


@Chase-Xuu, thank you for the clarification! I've verified the code and can confirm that all three issues have indeed been properly addressed:

  1. Multi-sample normalization (lines 318-341): The implementation correctly collects raw float coordinates in the samples list (line 330), computes the median on the raw floats (lines 346-347), and then calls _normalize_coordinates() exactly once at the end (line 350). This preserves sub-pixel precision throughout the aggregation process.

  2. Last two numbers (line 274): The parser correctly uses numericals[-2] and numericals[-1] to extract the last two numeric values, which properly handles prefixes like "Point 1: (523, 847)".

  3. Temperature=0.3 (line 321): Confirmed that temperature=0.3 is passed to call_llm_safe() for multi-sample calls, providing the variance needed for effective voting.

The implementation looks solid! The multi-sample voting approach with median aggregation should provide robust coordinate prediction for small UI elements while maintaining backward compatibility with the default n_samples=1 behavior. Great work on this enhancement! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The problem of coordinate offset

1 participant