Structured browser context capture for AI coding tools
Talk, draw, and click in your browser. PointDev captures the technical context and your intent, then compiles it into a structured prompt any AI agent can act on.
Getting Started · How It Works · Features · Roadmap · Contributing
You spot a problem in your browser. You switch to Claude Code and type "the hero font is too small." The agent has to guess which file, which component, what the current size is, and what "too small" means. The DOM path, component name, computed styles, and your spatial intent are all lost.
Browser automation gives AI agents eyes. PointDev gives humans a voice.
PointDev captures technical context (element selector, DOM subtree, computed styles, React component name, device metadata) and human context (timestamped voice narration, canvas annotations, cursor dwell behavior) simultaneously, then compiles everything into structured output any downstream tool can consume.
Key highlights:
- Talk, draw, click all at the same time during a single capture session
- Five annotation tools — circle, arrow, freehand, rectangle, and element selection
- Temporal correlation links what you're pointing at with what you're saying
- Smart screenshots auto-captured when you talk, dwell, or scroll — powered by real-time frame differencing
- Console + network capture catches errors and failed requests alongside your feedback
- Works on any web page, with opportunistic React component detection
- Copy and paste the structured output into any AI coding tool
This is actual output from PointDev, captured on a live site:
## Context
- URL: https://almostalab.io/
- Page title: Almost A Lab — Building the Future of AdTech
- Viewport: 1677 x 1145px
- Captured at: 2026-03-15 15:12:20
## User Intent (voice transcript)
[00:04] "I think the"
[00:06] "main hero"
[00:09] "is far too"
[00:11] "large"
[00:15] "and we need to adjust the line breaking"
[00:19] "two or three lines at Max"
[00:23] "and there is a problem at the bottom here"
[00:26] "where you scroll CTA"
[00:32] "overlapping with the"
[00:35] "subtitle of the page"
## Annotations
1. [00:40] Circle around .lg\:min-h-\[100svh\] at (42, 1019), radius 131px
## Cursor Behavior
- [00:07-00:32] Dwelled 25.5s over h1.font-display.text-hero (during: "main hero")
- [00:25-00:32] Dwelled 7.3s over div.absolute.bottom-10 (during: "scroll CTA")
- [00:29-00:36] Dwelled 6.2s over div.absolute.bottom-10 (during: "overlapping with the")
## Screenshots
1. [00:11] Multiple signals — "large"
Signals: visual change: 12%, dwell: h1.font-display.text-hero (4.2s), voice: "large" [score: 0.9]
2. [00:26] Multiple signals — "where you scroll CTA"
Signals: visual change: 31%, dwell: div.absolute.bottom-10 (7.3s), voice: "where you scroll CTA" [score: 0.8]
3. [00:40] Circle around .lg\:min-h-\[100svh\] — "subtitle of the page"
Signals: voice: "subtitle of the page" [score: 1.00]
We pasted PointDev output into a Claude Code session managing the live website. Without any prior context, the agent:
- Identified three UI issues from the voice transcript, annotations, and cursor dwell data
- Mapped each issue to specific elements (the
h1.font-display,div.absolute.bottom-10overlap, padding mismatch) - Offered to fix all three immediately, asking only "Want me to look at the Hero component and fix these alignment/spacing issues?"
The agent also described what would make the output even more actionable:
"The ideal output for an agent is: screenshot + voice intent + source file:line + computed styles on each annotation target. That's a one-shot fix with no exploration needed."
"Loom for humans, annotated screenshots + structured metadata for agents. Same capture session, two output formats."
That feedback is now driving our roadmap.
Status: Proof of Concept. Working demo, not a production release.
Requires Bun and Chrome.
git clone https://github.com/BraedenBDev/pointdev.git
cd pointdev
bun install
bun build- Open
chrome://extensions/, enable Developer Mode - Click Load unpacked and select the
dist/folder - Open any web page, click the PointDev icon to open the sidepanel
- Click Setup Microphone if you want voice narration (one-time, tab auto-closes)
- Chrome will prompt to approve host permissions on first use (needed for smart screenshots)
flowchart LR
subgraph Browser Tab
CS[Content Script]
MW[Main World<br/>Console/Network]
end
subgraph Extension
SP[Sidepanel<br/>React UI + Speech]
SI[Screenshot<br/>Intelligence]
SW[Service Worker<br/>State + Capture]
end
SP -- START_CAPTURE --> SW
SW -- INJECT_CAPTURE --> CS
CS -- element, annotation,<br/>cursor --> SW
MW -- CustomEvent --> CS
CS -- CONSOLE_BATCH --> SW
SI -- SNAPSHOT_REQUEST --> SW
SW -- DWELL_UPDATE --> SI
SP -. voice signal .-> SI
SI -- SMART_SCREENSHOT --> SW
SW -- SESSION_UPDATED --> SP
Sidepanel (React): Capture controls, live feedback, voice transcription (Web Speech API runs here directly), screenshot thumbnails with copy-to-clipboard, compiled output display.
Service Worker: Coordinates state between sidepanel and content script. Holds the CaptureSession, routes messages, captures screenshots via captureVisibleTab (both full-resolution for storage and low-quality JPEG for frame differencing), runs real-time dwell detection, injects main-world console/network capture script.
Content Script: Injected into the active page. Handles element selection (with Alt+scroll ancestry cycling), canvas annotation overlay (circle, arrow, freehand, rectangle), cursor tracking, React component detection, and CSS variable discovery.
Screenshot Intelligence (Sidepanel): Requests low-quality JPEG snapshots from the service worker every 2 seconds and compares them at 160x90 resolution via sparse pixel differencing. Combines frame diff results with cursor dwell and voice activity signals to produce a weighted interest score. Screenshots are captured when the score exceeds a threshold, and always on annotations (with a render delay so the canvas overlay is visible). Voice context is retained for 5 seconds after speech ends to ensure nearby captures carry relevant narration.
Main World Script: Injected into the page's JavaScript world via chrome.scripting.executeScript({ world: 'MAIN' }). Monkey-patches console.error/warn, fetch, and XMLHttpRequest to capture errors and failed requests. Bridges data back to the content script via CustomEvent.
All capture data flows into a single CaptureSession object with timestamps relative to recording start. A template formatter compiles this into structured output.
Technical context (captured automatically):
| Feature | Description |
|---|---|
| CSS selector + DOM subtree | Click any element to capture its selector and surrounding HTML |
| Computed styles + box model | font-size, color, spacing, display, position, content/padding/border/margin dimensions |
| CSS custom properties | Discovers --variable declarations from matching stylesheet rules |
| React component detection | Resolves component name via __reactFiber$ internals |
| Console errors + network failures | Captures console.error/warn, failed fetch/XHR, uncaught exceptions, unhandled rejections |
| Page + device metadata | URL, title, viewport, browser, OS, screen size, pixel ratio, touch, color scheme |
| Cursor dwell tracking | Records which elements you hover over and for how long |
| Smart screenshots | Auto-captured by multi-signal intelligence: frame diff (CV), cursor dwell, voice activity, annotations |
Human context (your input):
| Feature | Description |
|---|---|
| Voice narration | Speak naturally; transcription runs live with timestamps |
| Visual annotations | Circle, arrow, freehand, and rectangle tools |
| Element selection | Click to select, Alt+scroll to cycle through parent/child elements |
Everything is temporally correlated. The cursor dwell data shows which element you were pointing at when you said each phrase. Annotations are timestamped to align with your voice. Screenshots are enriched with the voice context from the moment you drew the annotation.
| Layer | Technology |
|---|---|
| Extension | Chrome Manifest V3 |
| UI | React 18, TypeScript |
| Build | Vite + CRXJS |
| Runtime | Bun |
| Canvas | HTML5 Canvas API (annotation overlay) |
| Voice | Web Speech API |
| Testing | Vitest (1188 tests) |
pointdev/
├── src/
│ ├── background/ # Service worker, message handler, session store
│ ├── content/ # Element selector, canvas overlay, cursor tracker,
│ │ # React inspector, device metadata, console/network capture
│ ├── shared/ # Types, message definitions, template formatter,
│ │ # dwell computation
│ └── sidepanel/ # React UI: App, hooks, components, screenshot intelligence
├── public/ # Mic-permission page, icons
├── tests/ # Vitest unit tests (mirrors src/ structure)
├── docs/
│ ├── design/ # MVP spec, implementation plan, library research
│ ├── superpowers/ # Feature specs and implementation plans
│ └── genai-disclosure/ # AI-assisted development log
├── CLAUDE.md # AI agent guidance for this codebase
├── CONTRIBUTING.md # Dev setup, testing, commit conventions
└── README.md
PointDev requests minimal Chrome permissions:
| Permission | Why |
|---|---|
activeTab |
Access the current tab when you start a capture |
scripting |
Inject content script + main-world console/network capture |
sidePanel |
The extension UI |
storage |
Persist capture session and mic permission state |
<all_urls> (host) |
Required for periodic screenshot capture during active sessions (captureVisibleTab from sidepanel context needs host permission — activeTab alone doesn't work from sidepanel timers) |
Screenshots are only captured during active capture sessions you explicitly start. No background access to your browsing. Voice transcription offers two modes: "Fast (Google)" uses Web Speech API (audio sent to Google), "Private (On-device)" uses Whisper via WASM (zero network traffic, all inference local).
- Element selection with CSS selector, computed styles, box model, DOM subtree
- React component detection via fiber internals
- CSS custom property discovery on selected elements
- Canvas annotation overlay (circle, arrow, freehand, rectangle) with scroll anchoring
- Element ancestry cycling (Alt+scroll to select parent/child)
- Voice transcription with timestamped segments (sidepanel-native)
- Cursor dwell tracking with temporal correlation
- Device metadata capture
- Smart screenshots via multi-signal intelligence (frame diff + dwell + voice + annotations)
- Console errors + failed network request capture (main-world injection)
- Compiled structured output with copy-to-clipboard
- Local speech-to-text via Whisper — on-device, zero cloud dependency (#7)
- Pluggable output formats: Text, JSON, Markdown (#10)
- Bridge server for AI tool delivery via WebSocket + MCP (#12)
- Source file path resolution from selectors (#20)
- Firefox WebExtensions port (#34)
- Tab video recording for session replay (#33)
- Accessibility capture (ARIA roles, names) (#23)
- Multi-element selection (#13)
- Vue and Svelte component detection (#9)
See all open issues for the full backlog.
Contributions are welcome! See CONTRIBUTING.md for development setup, testing, coding standards, and commit conventions.
Look for issues labeled good first issue and help wanted.
This project is built using AI coding tools (Claude Code, Cursor) as the primary implementation workflow. The developer architects solutions, defines acceptance criteria, and reviews all code. AI agents execute implementation tasks under human direction.
Commits from AI agents use Co-Authored-By tags so they are distinguishable from human-authored commits. All code is reviewed, tested, and validated by the maintainer before merging. Architectural decisions are made by the human lead.
This is directly relevant to PointDev's mission: we're building a tool that improves the input side of human-to-AI-coder communication, and we're building it with those same tools.
Full development log: docs/genai-disclosure/development-log.md
MIT. See LICENSE.
github.com/BraedenBDev/pointdev · Built by Braeden Bihag at Almost a Lab