Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions .agents/skills/cg-perf/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,36 @@ reports `min/p50/p95/p99/MAX` plus per-stage breakdown and settle cost.
| `frameloop` | 16/50/80/120/200/300/500ms interval | **Real FrameLoop path** — the only bench that captures stable-frame jank during panning (see below) |
| `resize` | alternating viewport sizes | `--resize` flag. Measures `resize()` + `redraw()` cost per cycle (layout rebuild + cache invalidation + repaint) |

**SurfaceUI overlay measurement (`--overlay`):**

By default, benchmarks measure content rendering only — the SurfaceUI
overlay (frame titles, node badges, hit regions) is **not** included.
Pass `--overlay` to include overlay drawing after each content flush,
matching the real `Application::frame()` pipeline where
`draw_and_flush_devtools_overlay()` runs after `Renderer::flush()`.

```sh
# A/B test: content only vs content + overlay (MUST be sequential, never parallel)
cargo run -p grida-dev --release -- bench ./fixtures/test-grida/L0.grida --scene 22 --frames 200 && \
cargo run -p grida-dev --release -- bench ./fixtures/test-grida/L0.grida --scene 22 --frames 200 --overlay

# Bulk report with overlay
cargo run -p grida-dev --release -- bench-report ./fixtures/ --frames 100 --overlay --output overlay.json
```

The overlay cost is opt-in because it is a devtools feature, not user
content. Overlay cost scales with visible labeled nodes — viewport
culling skips off-screen labels, so zoomed-in views are nearly free.
At fit-zoom on large scenes (yrr-main, 437 labels visible), overlay
adds ~1.8ms per frame (paragraph layout dominates). At typical editing
zoom, the cost drops to ~190µs or less.

| Question | Flag |
| ---------------------------------------- | -------------- |
| Is content rendering fast enough? | (no flag) |
| Is the overlay adding visible latency? | `--overlay` |
| Compare content-only vs content+overlay? | Run both, diff |

The `realtime` scenarios use actual `thread::sleep()` between frames
and simulate the native viewer's 240Hz tick thread + settle countdown.
These produce frame timings that match what users actually see,
Expand Down Expand Up @@ -187,13 +217,19 @@ of scenes, configs, and operations. The naming convention is
| Are there frame drops during gestures? | Check `p99` and `MAX` in scenario stats |
| Is slow panning janky (stable frame spikes)? | `frameloop` scenarios (real FrameLoop path) |
| Is resize janky? | Single-scene GPU bench with `--resize` |
| Is the SurfaceUI overlay causing slowdowns? | A/B with `--overlay` flag on GPU bench |

---

## The Verification Workflow

**Every performance change follows this sequence. No exceptions.**

**Critical: all GPU benchmarks must run sequentially.** Never run two
bench processes at the same time — GPU contention, CPU cache thrashing,
and memory bandwidth competition produce unreliable numbers. Chain A/B
runs with `&&`.

### Step 1: Baseline

Run the bulk benchmark report BEFORE any changes. Save the JSON output
Expand Down Expand Up @@ -418,6 +454,29 @@ loading a scene.
These are failure modes learned from experience. Each one has caused
real bugs or wasted time.

### Never run GPU benchmarks in parallel

GPU benchmarks must run **sequentially**, one at a time. Running two
bench processes simultaneously on the same machine causes GPU pipeline
contention, CPU cache thrashing, and memory bandwidth competition —
all of which distort timing data. When doing A/B comparisons (e.g.
with/without `--overlay`, before/after an optimization), always chain
the runs with `&&`:

```sh
# CORRECT — sequential
cargo run -p grida-dev --release -- bench file.grida --scene 0 --frames 100 && \
cargo run -p grida-dev --release -- bench file.grida --scene 0 --frames 100 --overlay

# WRONG — parallel (results are unreliable)
cargo run ... --frames 100 &
cargo run ... --frames 100 --overlay &
```

This applies to all GPU benchmark invocations: `bench`, `bench-report`,
and any combination thereof. Criterion (CPU raster) is less sensitive
but should still be run alone for best accuracy.

### GPU and raster backends behave differently

An optimization that helps on GPU may hurt on raster, and vice versa.
Expand Down Expand Up @@ -482,6 +541,20 @@ frame gets a cache hit. Without recapture, every frame after settle
is also a full draw, producing 7fps instead of 100+fps. The capture
guard should be `if self.backend.is_gpu()` — NOT `if !plan.stable`.

### SurfaceUI overlay is not free

The SurfaceUI overlay (frame titles, node badges, hit regions) runs
after content `flush()` and requires a second GPU flush. The overlay
cost is dominated by Skia paragraph creation (one per visible label) —
viewport culling skips off-screen labels, and style objects are hoisted
out of the per-label loop. On scenes with many labeled nodes at
fit-zoom (e.g. yrr-main with 437 labels), the overlay adds ~1.8ms per
frame. At typical editing zoom, most labels are culled and cost drops
to ~190µs. Standard benchmarks exclude overlay by default — use
`--overlay` to include it. If the app feels slower after adding new
overlay features (badges, labels, decorations), use the A/B overlay
benchmark to quantify the regression before optimizing.

### Layout is the cold-start bottleneck, not rendering

For large documents (100K+ nodes), `load_scene` dominates cold start
Expand Down
171 changes: 171 additions & 0 deletions .agents/skills/vision/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
---
name: vision
description: >
Query images with a local Ollama vision model without loading the image
into the main agent context. Use when you need to describe a screenshot,
check whether rendered content is present, detect overlapping elements, or
ask any visual question about a PNG/JPEG/WebP file. Requires Ollama running
locally with a vision-capable model (qwen3.5, gemma3, llava, etc.).
Script: .agents/skills/vision/scripts/ask.py.
Trigger phrases: "describe image", "what does this screenshot show",
"does the canvas contain content", "check screenshot visually",
"look at this image", "any overlapping elements", "vision query".
---

# Vision — Local Image Querying via Ollama

Ask natural-language questions about images without passing them to the main
agent as visual input. Useful for verifying screenshots, annotating assets,
or building automated checks around visual output.

## When to Use This Skill

- Describing a screenshot for a PR description or user-facing document
- Checking whether an automated browser run produced visible canvas content
- Asking "do any elements overlap?" on a rendered output
- Any question where the answer is in the pixels but you don't want to use
vision tokens in the main context

---

## Quick Reference

All commands use `uv run` — dependencies are installed automatically.

```sh
SCRIPT=.agents/skills/vision/scripts/ask.py

# health check (fast, no image, confirms Ollama + model respond)
uv run $SCRIPT --ping

# system info — memory, storage, installed models
uv run $SCRIPT --info
uv run $SCRIPT --memory
uv run $SCRIPT --storage

# describe an image (default prompt)
uv run $SCRIPT path/to/image.png

# explicit shortcut
uv run $SCRIPT path/to/image.png describe

# custom question
uv run $SCRIPT path/to/image.png \
--prompt "Do you see any overlapping UI elements?"

uv run $SCRIPT canvas.png \
--prompt "Does this canvas contain any designed content, or is it empty?"

# use a specific model
uv run $SCRIPT image.png --model gemma3

# list available vision-capable models
uv run $SCRIPT --list-models
```

---

## Prerequisites

Ollama must be running locally. The script connects to `http://localhost:11434`
and fails immediately if it cannot reach it.

```sh
# start Ollama (if not already running)
ollama serve

# install a vision model (first time, pick one)
ollama pull qwen3.5 # recommended — best low-cost vision as of 2026-03
ollama pull gemma3 # alternative
ollama pull llava # widely available fallback
```

The script **does not install models**. If no vision model is available it
prints the list of installed models and a `pull` suggestion, then exits.

**`uv` is required** to run the script (handles dependency installation
automatically). No `requirements.txt` or manual `pip install` needed.

---

## Model Selection

When `--model` is not specified the script picks the first installed model
from this preference list (updated periodically):

| Priority | Model | Notes |
| -------- | ----------------- | ---------------------------------------- |
| 1 | `qwen3.5` | Best low-cost vision model as of 2026-03 |
| 2 | `qwen2.5vl` | Previous generation, still strong |
| 3 | `gemma3` | Strong alternative, multimodal |
| 4 | `llama3.2-vision` | Meta vision variant |
| 5 | `llava` | Widely installed fallback |

A model is considered vision-capable when its name contains: `qwen`, `gemma3`,
`vl`, `vision`, `llava`, `moondream`, `minicpm-v`, or similar fragments.
Non-vision text models are filtered out automatically.

---

## System Info

Before running a heavy query, check whether the machine has enough resources.
This is optional — the script does not enforce limits — but useful context
for deciding whether to proceed or skip.

```sh
uv run $SCRIPT --info # memory + storage + model list
uv run $SCRIPT --memory # just memory
uv run $SCRIPT --storage # just storage
```

Tip: on machines with ≤8 GB RAM, large vision models may cause swapping or
OOM. Consider using a smaller model variant or skipping the query.

---

## Behavior

- **Fails fast** if Ollama is unreachable or no vision model is installed.
Exit code is non-zero; the error message includes a `hint` or `pull` command.
- **Sequential only** — Ollama is a single-worker process. Never call `ask.py`
in parallel (e.g. two concurrent tool calls). Queue calls one at a time.
- **No side effects** beyond the local Ollama process.
- **Auto-installs deps** via `uv` inline script metadata (PEP 723). Only
dependency is the `ollama` Python package.
- Supported formats: `.png`, `.jpg`, `.jpeg`, `.webp`, `.gif`, `.bmp`.

---

## Typical Agent Workflow

1. A tool (browser automation, screenshot capture, golden renderer) writes
an image to disk.
2. Call `ask.py` with a targeted prompt suited to the task.
3. Parse the text response to decide the next action.

```sh
# Quick sanity check first
uv run $SCRIPT --ping

# Verify a browser screenshot has content before including it in a doc
uv run $SCRIPT /tmp/preview.png \
--prompt "Answer with YES or NO: does this screenshot show any visible UI content, shapes, or text?"

# Describe a golden render for a PR description
uv run $SCRIPT crates/grida-canvas/goldens/progressive_blur.png \
--prompt "Describe what visual effect is shown. Be specific about blur, colors, and shapes."
```

---

## Troubleshooting

| Symptom | Cause | Fix |
| -------------------------------- | -------------------------------- | ----------------------------------------- |
| `cannot reach Ollama` | Ollama not running | `ollama serve` |
| `no vision-capable models found` | Only text models installed | `ollama pull qwen3.5` |
| `model 'X' is not available` | Model name typo or not installed | `--list-models` to see what's installed |
| Slow response | Large model on CPU | Use a smaller variant (e.g. `qwen3.5:3b`) |
| Vague or wrong answer | Generic prompt | Write a more specific `--prompt` |
| `'ollama' package not found` | Not using `uv run` | Run with `uv run ask.py` instead |
Loading
Loading